[00:00:24] same here, going afk. need to drive [00:01:45] (03CR) 10Krinkle: "Note to self: Confirm with Joe that using this directly cross-dc is fine when we're multi-dc, incl w.r.t. gutter pool, and w.r.t. hashing " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809326 (https://phabricator.wikimedia.org/T278392) (owner: 10Krinkle) [00:05:02] RECOVERY - Disk space on labweb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=labweb1001&var-datasource=eqiad+prometheus/ops [00:05:14] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:06:48] RECOVERY - Disk space on labweb1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=labweb1002&var-datasource=eqiad+prometheus/ops [00:10:10] PROBLEM - Check systemd state on es2033 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2003-dev.codfw.wmnet with OS bullseye [00:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:05] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudgw20... [00:19:33] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Papaul) [00:20:45] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10Papaul) 05Open→03Resolved @Andrew thanks for getting me the partman recipe info. This is complete. [00:28:48] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-bscarone-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:20] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:41:37] 10SRE, 10ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10wiki_willy) a:05AndrewBonamici→03aborrero [00:50:26] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36106/console" [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [00:55:25] (03PS3) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) [00:56:02] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36107/console" [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [01:00:02] (03CR) 10Ssingh: [V: 03+1] "Not sure why PCC doesn't show the changed file for modules/bird/files/prometheus-bird-exporter.default, but well, that's the latest change" [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [01:05:48] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:14:00] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:14:28] (03PS1) 10Ssingh: admin: allow sudo for jclark-ctr for cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) [01:34:18] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:22] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:41:30] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:58] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:01:48] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 110 probes of 679 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:01:58] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:52] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:07:02] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 69 probes of 679 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:14:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:29:24] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 116 probes of 681 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:32:41] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:33:13] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 115 probes of 672 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:38:11] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:27] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 90 probes of 672 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:42:21] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:51:07] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 57 probes of 681 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:05:55] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:20:13] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:43] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.097 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:32:41] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:21] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [03:37:33] (03CR) 10Andrea Denisse: [C: 03+2] loki: add loki as an optional grafana component [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [03:41:53] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:30] (03CR) 10Tim Starling: [C: 03+2] Move $wgCentralAuthTokenCacheType from redis_local to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [04:32:21] (03Merged) 10jenkins-bot: Move $wgCentralAuthTokenCacheType from redis_local to mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683465 (https://phabricator.wikimedia.org/T278392) (owner: 10Aaron Schulz) [04:32:26] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:12] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648 [04:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:19] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [04:37:25] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgCentralAuthTokenCacheType -> mcrouter T278392 (duration: 03m 44s) [04:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:30] T278392: Storage solution for cross-datacenter tokens - https://phabricator.wikimedia.org/T278392 [04:37:54] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:40:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:18] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) Heyy thanks so much for all the work on this!!! just a few notes here from a super uninformed perspective, just on the off chance they might be useful... In... [05:18:37] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) Excellent, can you let me know a day and time that works for you to replace it? I can leave the host offline for you [05:33:34] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:34:00] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:35:30] RECOVERY - MariaDB read only es2 on es2033 is OK: Version 10.4.25-MariaDB-log, Uptime 38s, read_only: True, event_scheduler: True, 10.97 QPS, connection latency: 0.004945s, query latency: 0.068268s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [05:36:28] RECOVERY - mysqld processes on es2033 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:40:40] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:41] 10SRE, 10ops-codfw, 10DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (10Marostegui) Started a data check run [05:56:19] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648 [05:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:27] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [05:59:46] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:07] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Thanks for those graphs Amir! Let me know today once you are around, I want to repeat the test on db1132 (10.6) changin... [06:02:36] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648 [06:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:42] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [06:04:11] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648 [06:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:41] (03PS1) 10Marostegui: db21[53-74].yaml: Add files [puppet] - 10https://gerrit.wikimedia.org/r/809479 (https://phabricator.wikimedia.org/T311493) [06:06:40] (03CR) 10Marostegui: [C: 03+2] db21[53-74].yaml: Add files [puppet] - 10https://gerrit.wikimedia.org/r/809479 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [06:09:50] (03PS1) 10Marostegui: db21[53-74].yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809480 (https://phabricator.wikimedia.org/T311493) [06:10:33] (03CR) 10Marostegui: [C: 03+2] db21[53-74].yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809480 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [06:12:57] 10SRE, 10RESTBase-API, 10Traffic, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) Hm, I am pretty sure that I am doing rate limiting correctly on my side, but I am hitting 429s after a brief time when trying to do 1000/10s rate limit to... [06:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:29:00] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:33:42] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:40:22] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:24] (03CR) 10Slyngshede: [C: 03+1] "Looks good. Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/809181 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [06:46:27] (03CR) 10Slyngshede: [C: 03+2] logster: remove absented logster- cron [puppet] - 10https://gerrit.wikimedia.org/r/809181 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [06:46:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P30597 and previous config saved to /var/cache/conftool/dbconfig/20220629-064655-root.json [06:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nicely done, see inline for two non-blocking nits" [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [06:57:07] 10ops-codfw, 10decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Marostegui) [06:57:30] 10ops-codfw, 10decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Marostegui) [06:57:42] 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10fgiunchedi) Keeping the instances SGTM @BCornwall, thanks for looking into it. Personally I'd recommend starting afresh with a Pontoon stack (i.e. keep... [06:57:47] (03CR) 10Ayounsi: "No idea about prometheus-bird-exporter.default neither." [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [06:58:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2071 T311589', diff saved to https://phabricator.wikimedia.org/P30598 and previous config saved to /var/cache/conftool/dbconfig/20220629-065804-root.json [06:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:10] T311589: decommission db2071 - https://phabricator.wikimedia.org/T311589 [07:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:07] 10ops-codfw, 10decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Marostegui) [07:04:34] (03PS1) 10Marostegui: mariadb: Decommission db2071 [puppet] - 10https://gerrit.wikimedia.org/r/809526 (https://phabricator.wikimedia.org/T311589) [07:05:40] !log re-enabled bgp to telia in eqsin [07:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2071.codfw.wmnet [07:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:32] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2071 [puppet] - 10https://gerrit.wikimedia.org/r/809526 (https://phabricator.wikimedia.org/T311589) (owner: 10Marostegui) [07:08:57] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Marostegui) [07:10:42] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:10:45] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Marostegui) [07:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) (owner: 10Ssingh) [07:17:47] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2071.codfw.wmnet [07:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:51] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2071.codfw.wmnet` - db2071.codfw.wmnet (**FAIL**) - Downtimed host on Icinga/Alert... [07:18:48] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Marostegui) a:03Papaul [07:19:27] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Marostegui) @Papaul this is ready for you. Please note the failure above, make sure to wipe the disks yourself. [07:20:46] (03PS1) 10Ayounsi: Revert "eqsin: disable Telia transit" [homer/public] - 10https://gerrit.wikimedia.org/r/809353 [07:22:51] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) Looks like there are no more errors. @robh could you check it one last time before replying to Telia? `cr3-eqsin> show interfaces xe-0/1/1 extensive | match error [...] # Everything should show 0, es... [07:24:42] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts webperf1002.eqiad.wmnet [07:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:52] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:27:04] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:27:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2071 from dbctl', diff saved to https://phabricator.wikimedia.org/P30600 and previous config saved to /var/cache/conftool/dbconfig/20220629-072753-marostegui.json [07:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:01] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:55] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:32:07] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:34:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:34:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts webperf1002.eqiad.wmnet [07:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:18] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `webperf1002.eqiad.wmnet` - webperf1002.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [07:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:39] 10ops-codfw, 10decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Marostegui) [07:35:08] 10ops-codfw, 10decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Marostegui) [07:35:59] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [07:37:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2075 T311591', diff saved to https://phabricator.wikimedia.org/P30601 and previous config saved to /var/cache/conftool/dbconfig/20220629-073722-root.json [07:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:29] T311591: decommission db2075 - https://phabricator.wikimedia.org/T311591 [07:37:58] (03PS4) 10Urbanecm: GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) (owner: 10Kosta Harlan) [07:38:01] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) (owner: 10Kosta Harlan) [07:38:11] (03PS3) 10Urbanecm: GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: 10Kosta Harlan) [07:38:13] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: 10Kosta Harlan) [07:38:16] (03PS1) 10Marostegui: instances.yaml: Remove db2075 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/809528 (https://phabricator.wikimedia.org/T311591) [07:38:56] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2075 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/809528 (https://phabricator.wikimedia.org/T311591) (owner: 10Marostegui) [07:39:19] (03Merged) 10jenkins-bot: GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) (owner: 10Kosta Harlan) [07:39:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2075 from dbctl T311591', diff saved to https://phabricator.wikimedia.org/P30602 and previous config saved to /var/cache/conftool/dbconfig/20220629-073919-root.json [07:39:23] (03Merged) 10jenkins-bot: GrowthExperiments: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: 10Kosta Harlan) [07:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:10] (03CR) 10Ayounsi: [C: 03+2] Revert "eqsin: disable Telia transit" [homer/public] - 10https://gerrit.wikimedia.org/r/809353 (owner: 10Ayounsi) [07:40:25] !log dbmaint s1@codfw T311475 [07:40:26] !log dbmaint s@codfw T311475 [07:40:29] !log dbmaint s5@codfw T311475 [07:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:30] T311475: Decommission db[2071-2092] - https://phabricator.wikimedia.org/T311475 [07:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:00] (03Merged) 10jenkins-bot: Revert "eqsin: disable Telia transit" [homer/public] - 10https://gerrit.wikimedia.org/r/809353 (owner: 10Ayounsi) [07:43:08] (03PS1) 10Marostegui: mariadb: Remove db2075 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/809531 (https://phabricator.wikimedia.org/T311591) [07:43:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2075.codfw.wmnet [07:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:12] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Marostegui) [07:44:36] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete [07:45:25] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 143c3fd: d5afd97: Remove unused GEHomepageSuggestedEditsRequiresOptIn and GEHomepageSuggestedEditsTopicsRequiresOptIn (T308209, T308208) (duration: 03m 22s) [07:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:32] T308209: Remove GEHomepageSuggestedEditsTopicsRequiresOptIn - https://phabricator.wikimedia.org/T308209 [07:45:33] T308208: Remove GEHomepageSuggestedEditsRequiresOptIn - https://phabricator.wikimedia.org/T308208 [07:46:12] (03PS2) 10Urbanecm: Remove wgGEMentorDashboardBetaMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808263 [07:46:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:17] (03CR) 10Urbanecm: [C: 03+2] Remove wgGEMentorDashboardBetaMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808263 (owner: 10Urbanecm) [07:46:29] (03PS2) 10Urbanecm: [beta] Remove wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808264 [07:46:32] (03CR) 10Urbanecm: [C: 03+2] [beta] Remove wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808264 (owner: 10Urbanecm) [07:46:57] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:03] (03Merged) 10jenkins-bot: Remove wgGEMentorDashboardBetaMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808263 (owner: 10Urbanecm) [07:47:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:47:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:18] (03Merged) 10jenkins-bot: [beta] Remove wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808264 (owner: 10Urbanecm) [07:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:47] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [07:48:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:26] (03PS2) 10Urbanecm: Add GEMentorProvider to configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808267 (https://phabricator.wikimedia.org/T310905) [07:48:30] (03CR) 10Urbanecm: [C: 03+2] Add GEMentorProvider to configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808267 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [07:48:36] (03PS2) 10Urbanecm: [beta] Growth: Enable structured mentor list at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905) [07:48:42] (03CR) 10Urbanecm: [C: 03+2] [beta] Growth: Enable structured mentor list at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808268 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [07:49:18] (03Merged) 10jenkins-bot: Add GEMentorProvider to configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808267 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [07:50:47] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2075 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/809531 (https://phabricator.wikimedia.org/T311591) (owner: 10Marostegui) [07:50:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:18] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Marostegui) [07:51:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1d1b9cf: Remove wgGEMentorDashboardBetaMode (duration: 03m 34s) [07:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:06] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2075.codfw.wmnet [07:54:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:54:11] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:54:11] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2075.codfw.wmnet` - db2075.codfw.wmnet (**FAIL**) - Downtimed host on Icinga/Alert... [07:54:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:26] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Marostegui) [07:54:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:24] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Marostegui) a:03Papaul @Papaul this is ready for you. Please note the failure above, make sure to wipe the disks yourself. [07:55:50] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5a583804: Add GEMentorProvider to configuration (T310905) (duration: 03m 40s) [07:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:56] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [07:59:04] * urbanecm done [08:00:05] dduvall and hashar: Dear deployers, time to do the MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T0800). [08:00:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:00:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:28] (03PS1) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [08:06:04] (03CR) 10CI reject: [V: 04-1] WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [08:10:20] (03PS2) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [08:12:11] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) Can't we just import the Cassandra 4 debs and use those? The work needs to happen at some point anyway and it's a fresh cluster. Buster is almost three years old, going into LT... [08:13:39] (03PS1) 10Elukey: role::ml_k8s::worker::staging: add calico-cni config [puppet] - 10https://gerrit.wikimedia.org/r/809534 (https://phabricator.wikimedia.org/T302195) [08:14:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36110/console" [puppet] - 10https://gerrit.wikimedia.org/r/809534 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [08:15:21] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::worker::staging: add calico-cni config [puppet] - 10https://gerrit.wikimedia.org/r/809534 (https://phabricator.wikimedia.org/T302195) (owner: 10Elukey) [08:18:43] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36109/console" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [08:25:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I'm going to try updating the RAID controller firmware, then the BIOS on stat1010, to see if either of these fixes the drive ordering issue.... [08:30:11] 10SRE-tools, 10Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (10Marostegui) [08:31:15] (03PS1) 10Filippo Giunchedi: prometheus: add initial blackbox dns probes for wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) [08:31:17] (03PS1) 10Filippo Giunchedi: prometheus: probe DNS for (www).wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) [08:31:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet [08:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:12] (03CR) 10CI reject: [V: 04-1] prometheus: probe DNS for (www).wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:39:15] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:47] 10SRE-tools, 10Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (10MoritzMuehlenhoff) Maybe we need run "swapoff -a" prior to the wipefs call? [08:43:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet [08:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:42] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on idp-test1002.wikimedia.org with reason: webauthn tests [08:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on idp-test1002.wikimedia.org with reason: webauthn tests [08:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:06] 10SRE-tools, 10Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (10Marostegui) I have a few more hosts to decommission. I can try to do so, but we'd not know whether it helped or it would have just worked without it too :) Up t... [08:55:09] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Jclark-ctr) I can do it today if you can offline it [08:58:29] 10SRE, 10ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10Jclark-ctr) a:05aborrero→03Andrew [08:58:54] (03PS3) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [08:59:46] (03CR) 10CI reject: [V: 04-1] WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [09:00:57] (03PS4) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [09:01:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1173 for on-site maintenance T310595', diff saved to https://phabricator.wikimedia.org/P30603 and previous config saved to /var/cache/conftool/dbconfig/20220629-090120-root.json [09:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:27] T310595: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 [09:02:03] (03PS1) 10Marostegui: db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809540 (https://phabricator.wikimedia.org/T310595) [09:03:02] (03CR) 10Marostegui: [C: 03+2] db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809540 (https://phabricator.wikimedia.org/T310595) (owner: 10Marostegui) [09:03:05] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) @Jclark-ctr host offline, you can proceed whenever you want. Once you are done, please power it back on and I will take it from there. Thanks a lot! [09:03:36] 10SRE-tools, 10Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (10MoritzMuehlenhoff) >>! In T311593#8036155, @Marostegui wrote: > I have a few more hosts to decommission. I can try to do so, but we'd not know whether it helped... [09:08:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36111/console" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [09:09:55] (03PS7) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [09:10:32] (03CR) 10CI reject: [V: 04-1] WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [09:14:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36113/console" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [09:22:13] (03PS5) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [09:23:43] 10SRE-tools, 10Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (10Marostegui) Wilco! [09:24:54] (03CR) 10CI reject: [V: 04-1] WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [09:27:51] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: separate metric current_client_connections [puppet] - 10https://gerrit.wikimedia.org/r/803285 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [09:29:45] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [09:33:29] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: remove wmf-tls log format [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [09:34:15] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:21] (03PS1) 10David Caro: wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 [09:34:56] (03PS6) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [09:37:34] (03CR) 10CI reject: [V: 04-1] WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [09:38:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1132 with some weight to get it warmed up', diff saved to https://phabricator.wikimedia.org/P30605 and previous config saved to /var/cache/conftool/dbconfig/20220629-093826-root.json [09:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:01] (03CR) 10CI reject: [V: 04-1] wmcs.openstack: move libs to it's own module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/809543 (owner: 10David Caro) [09:41:21] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:23] (03PS7) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [09:51:54] (03PS8) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [09:53:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:53:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [09:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [09:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:12] (03PS9) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [10:00:13] (03PS1) 10Kosta Harlan: Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809549 (https://phabricator.wikimedia.org/T311467) [10:00:26] (03PS1) 10Kosta Harlan: Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809550 (https://phabricator.wikimedia.org/T311467) [10:03:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:35] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:03:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T309311)', diff saved to https://phabricator.wikimedia.org/P30606 and previous config saved to /var/cache/conftool/dbconfig/20220629-100341-ladsgroup.json [10:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:46] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [10:04:08] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36116/console" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [10:08:02] (03PS10) 10Slyngshede: WIP: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 [10:12:56] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36117/console" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [10:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:16:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T309311)', diff saved to https://phabricator.wikimedia.org/P30607 and previous config saved to /var/cache/conftool/dbconfig/20220629-101655-ladsgroup.json [10:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:00] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [10:26:49] (03CR) 10Klausman: [C: 03+1] "Overall, LGTM with two small bits." [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [10:28:42] 10SRE, 10API Platform, 10Traffic, 10VisualEditor, and 2 others: Find out if Varnish is messing with ETags, and what to do about it. - https://phabricator.wikimedia.org/T310904 (10daniel) p:05Triage→03Medium [10:32:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P30608 and previous config saved to /var/cache/conftool/dbconfig/20220629-103200-ladsgroup.json [10:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:33] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:43] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:42] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Product-Infrastructure-Team-Backlog, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mob... - https://phabricator.wikimedia.org/T259812 [10:45:49] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:47:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P30610 and previous config saved to /var/cache/conftool/dbconfig/20220629-104705-ladsgroup.json [10:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:46] (03PS4) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) [10:48:52] (03CR) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [10:49:01] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36118/console" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (owner: 10Slyngshede) [10:49:29] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36119/console" [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [10:51:50] (03PS11) 10Slyngshede: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) [10:53:43] (03CR) 10Ssingh: "[Commenting to indicate that this needs backward compatibility]" [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [10:53:46] (03CR) 10Ssingh: "[Commenting to indicate that this needs backward compatibility]" [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [10:56:18] (03CR) 10Slyngshede: "Not sure if this is the right way to go about adding the Ganeti metrics, but there seemed to be no existing way to do so." [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [10:59:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After restart', diff saved to https://phabricator.wikimedia.org/P30612 and previous config saved to /var/cache/conftool/dbconfig/20220629-105859-root.json [10:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:14] (03CR) 10Slyngshede: "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/809179 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:01:17] (03CR) 10Slyngshede: [C: 03+2] snapshot: remove absented dumps-timechecker cron [puppet] - 10https://gerrit.wikimedia.org/r/809179 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:02:01] (03CR) 10Slyngshede: "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/809178 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:02:02] (03CR) 10Slyngshede: [C: 03+2] dumps: remove absented dumps-fetches-wikitech cron [puppet] - 10https://gerrit.wikimedia.org/r/809178 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:02:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T309311)', diff saved to https://phabricator.wikimedia.org/P30613 and previous config saved to /var/cache/conftool/dbconfig/20220629-110210-ladsgroup.json [11:02:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:17] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After restart', diff saved to https://phabricator.wikimedia.org/P30614 and previous config saved to /var/cache/conftool/dbconfig/20220629-111403-root.json [11:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:49] (03CR) 10Muehlenhoff: "The bean error mentioned in the patch description are unrelated, this was ultimately an error caused by misleading CAS documentation (for " [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/809132 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [11:20:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [11:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [11:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T309311)', diff saved to https://phabricator.wikimedia.org/P30615 and previous config saved to /var/cache/conftool/dbconfig/20220629-112054-ladsgroup.json [11:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:00] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:26:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1005.mgmt.eqiad.wmnet with reboot policy FORCED [11:26:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudrabbit1001.mgmt.eqiad.wmnet with reboot policy FORCED [11:26:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED [11:26:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudservices1005.mgmt.eqiad.wmnet with reboot policy FORCED [11:26:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudrabbit1002.mgmt.eqiad.wmnet with reboot policy FORCED [11:26:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudrabbit1003.mgmt.eqiad.wmnet with reboot policy FORCED [11:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:39] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [11:29:02] (03PS12) 10Slyngshede: profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) [11:29:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After restart', diff saved to https://phabricator.wikimedia.org/P30616 and previous config saved to /var/cache/conftool/dbconfig/20220629-112907-root.json [11:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:56] (03PS2) 10Filippo Giunchedi: prometheus: probe DNS for (www).wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) [11:32:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T309311)', diff saved to https://phabricator.wikimedia.org/P30617 and previous config saved to /var/cache/conftool/dbconfig/20220629-113207-ladsgroup.json [11:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:14] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:32:59] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:55] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36120/console" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [11:35:01] (03CR) 10Muehlenhoff: profile::prometheus::ops add ganeti cluster targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [11:35:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Cmjohnson) [11:38:30] (03CR) 10Slyngshede: profile::prometheus::ops add ganeti cluster targets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [11:42:01] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [11:44:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After restart', diff saved to https://phabricator.wikimedia.org/P30618 and previous config saved to /var/cache/conftool/dbconfig/20220629-114411-root.json [11:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P30619 and previous config saved to /var/cache/conftool/dbconfig/20220629-114712-ladsgroup.json [11:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudrabbit1002.mgmt.eqiad.wmnet with reboot policy FORCED [11:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudrabbit1003.mgmt.eqiad.wmnet with reboot policy FORCED [11:48:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudrabbit1001.mgmt.eqiad.wmnet with reboot policy FORCED [11:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudservices1005.mgmt.eqiad.wmnet with reboot policy FORCED [11:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:41] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED [11:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet1005.mgmt.eqiad.wmnet with reboot policy FORCED [11:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED [11:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:37] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED [11:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) [11:54:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) @Jclark-ctr Can you verify the mgmt cable is connected for cloudnet1006. [11:56:59] (03CR) 10Ayounsi: "The DNS check is from before my time. Leaving it to Brandon as it's DNS related and he will know better than me on how best to monitor it." [puppet] - 10https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:02:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P30620 and previous config saved to /var/cache/conftool/dbconfig/20220629-120217-ladsgroup.json [12:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:58] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:33] PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:42] (03PS5) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847) [12:08:44] (03PS4) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [12:08:46] (03PS1) 10Filippo Giunchedi: prometheus: adjust check::http params based on distro [puppet] - 10https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) [12:10:05] (03CR) 10Filippo Giunchedi: "Ideally we run Bullseye everywhere (https://phabricator.wikimedia.org/T309979) in the meantime adjust options accordingly" [puppet] - 10https://gerrit.wikimedia.org/r/809586 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:13:54] (03CR) 10Filippo Giunchedi: "+ traffic folks CC'd (feel free to review/comment!)" [puppet] - 10https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:15:13] (03CR) 10Filippo Giunchedi: "This is the "deployment" of the probes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/809535 and will be performed from all sites" [puppet] - 10https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:17:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T309311)', diff saved to https://phabricator.wikimedia.org/P30621 and previous config saved to /var/cache/conftool/dbconfig/20220629-121722-ladsgroup.json [12:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:30] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [12:24:48] !log mforns@deploy1002 Started deploy [analytics/refinery@2f5987d]: Regular analytics weekly train [analytics/refinery@2f5987d] [12:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:57] !log mforns@deploy1002 Finished deploy [analytics/refinery@2f5987d]: Regular analytics weekly train [analytics/refinery@2f5987d] (duration: 01m 08s) [12:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:22] (03PS1) 10Jcrespo: Prepare for 0.1.3 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809588 (https://phabricator.wikimedia.org/T311215) [12:26:25] (03PS1) 10Jcrespo: cli: Change logging to log on a different file each [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809589 (https://phabricator.wikimedia.org/T311215) [12:26:43] !log mforns@deploy1002 Started deploy [analytics/refinery@2f5987d] (thin): Regular analytics weekly train THIN [analytics/refinery@2f5987d] [12:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:51] !log mforns@deploy1002 Finished deploy [analytics/refinery@2f5987d] (thin): Regular analytics weekly train THIN [analytics/refinery@2f5987d] (duration: 00m 07s) [12:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:02] !log mforns@deploy1002 Started deploy [analytics/refinery@2f5987d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f5987d] [12:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:13] (03PS2) 10Jcrespo: cli: Change logging to log on a different file each [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809589 (https://phabricator.wikimedia.org/T311215) [12:34:09] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:35] !log mforns@deploy1002 Finished deploy [analytics/refinery@2f5987d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f5987d] (duration: 07m 32s) [12:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service,refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:54] (03CR) 10Slyngshede: [C: 03+2] profile::prometheus::ops add ganeti cluster targets [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [12:41:17] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:26] !log otto@deploy1002 Started deploy [analytics/refinery@2f5987d]: (no justification provided) [12:47:29] !log otto@deploy1002 Finished deploy [analytics/refinery@2f5987d]: (no justification provided) (duration: 00m 03s) [12:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:34] !log otto@deploy1002 Started deploy [analytics/refinery@2f5987d]: (no justification provided) [12:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:34] !log otto@deploy1002 Finished deploy [analytics/refinery@2f5987d]: (no justification provided) (duration: 02m 00s) [12:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:51] !log otto@deploy1002 Started deploy [analytics/refinery@2f5987d]: (no justification provided) [12:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:29] !log otto@deploy1002 Finished deploy [analytics/refinery@2f5987d]: (no justification provided) (duration: 00m 37s) [12:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:49] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,DELETE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:56:05] RECOVERY - Hadoop HDFS Namenode FSImage Age on an-master1002 is OK: FILE_AGE OK: /srv/hadoop/name/current/VERSION is 109 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:56:48] sukhe: I'm here [12:57:20] XioNoX: hello! [12:57:28] waiting for your final review of the patch [12:57:35] and then we can get started [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T1300). [13:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] (03PS1) 10Marostegui: instances.yaml: Remove db2081 [puppet] - 10https://gerrit.wikimedia.org/r/809592 (https://phabricator.wikimedia.org/T311475) [13:00:25] i can deploy today [13:00:27] hi MatmaRex [13:00:31] hi [13:00:47] MatmaRex: would you prefer to test them separately, or at once? [13:00:48] sukhe: re-checking but I think I was find with it [13:00:58] thanks! [13:01:13] I am preparing the other sutff, please take your time [13:01:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "I forgot one fundamental thing: you'll need to add the corresponding job definition for prometheus to pick up, e.g. like $trafficserver_jo" [puppet] - 10https://gerrit.wikimedia.org/r/809533 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [13:01:38] urbanecm: either is fine [13:01:42] okay, thanks [13:01:46] (03PS2) 10Urbanecm: Enable DiscussionTools newtopictool at enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809010 (https://phabricator.wikimedia.org/T311023) (owner: 10Bartosz Dziewoński) [13:01:50] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools newtopictool at enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809010 (https://phabricator.wikimedia.org/T311023) (owner: 10Bartosz Dziewoński) [13:02:00] (03PS3) 10Urbanecm: Enable DiscussionTools on mobile at partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809012 (https://phabricator.wikimedia.org/T298221) (owner: 10Bartosz Dziewoński) [13:02:02] (03CR) 10Ayounsi: [C: 03+1] "ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:02:04] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools on mobile at partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809012 (https://phabricator.wikimedia.org/T298221) (owner: 10Bartosz Dziewoński) [13:02:05] sukhe: +1 [13:02:10] i'm a little distracted so please give me a few more minutes to test [13:02:14] XioNoX: here's to a third time lucky :P [13:02:27] RECOVERY - Check systemd state on ms-be1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:41] (03PS3) 10Urbanecm: Enable DiscussionTools visualenhancements at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809011 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński) [13:02:42] !log sudo cumin -d 'P{R:Class = bird}' 'disable-puppet "PLEASE DO NOT enable Puppet: deploying T310574"' [13:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:48] T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 [13:02:59] (03Merged) 10jenkins-bot: Enable DiscussionTools newtopictool at enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809010 (https://phabricator.wikimedia.org/T311023) (owner: 10Bartosz Dziewoński) [13:03:03] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools visualenhancements at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809011 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński) [13:03:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS buster [13:03:08] (03Merged) 10jenkins-bot: Enable DiscussionTools on mobile at partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809012 (https://phabricator.wikimedia.org/T298221) (owner: 10Bartosz Dziewoński) [13:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:11] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:03:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS... [13:03:26] (03PS2) 10Urbanecm: Enable DiscussionTools on mobile at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809223 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński) [13:03:29] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools on mobile at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809223 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński) [13:04:02] (03Merged) 10jenkins-bot: Enable DiscussionTools visualenhancements at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809011 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński) [13:04:28] XioNoX: going with durum1001 [13:04:32] (03Merged) 10jenkins-bot: Enable DiscussionTools on mobile at mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809223 (https://phabricator.wikimedia.org/T310960) (owner: 10Bartosz Dziewoński) [13:04:41] yay! [13:04:41] no package updates required on this host [13:04:44] (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/809227 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [13:04:58] urbanecm: sorry, i'm away for 10 minutes [13:05:09] no problem MatmaRex, I'll pull to mwdebug and wait for you to return [13:05:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:30] MatmaRex: pulled to mwdebug1001, ready for you to test once you're back. [13:06:04] XioNoX: PCC wouldn't show it but [13:06:04] +ARGS="-bird.v2 -format.new=true" [13:06:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:06:06] it picked it up [13:06:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:28] nice [13:06:43] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1025.eqiad.wmnet with OS buster [13:06:46] sukhe: let's see if everything falls into place on its own or if there is anything to kick [13:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS bus... [13:07:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:14] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2081 [puppet] - 10https://gerrit.wikimedia.org/r/809592 (https://phabricator.wikimedia.org/T311475) (owner: 10Marostegui) [13:07:22] oops interesting [13:07:32] bird.conf line 8 [13:07:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1027.eqiad.wmnet with OS buster [13:07:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1027.eqiad.wmnet with OS... [13:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2081 from dbctl T311475', diff saved to https://phabricator.wikimedia.org/P30622 and previous config saved to /var/cache/conftool/dbconfig/20220629-130741-marostegui.json [13:07:43] let me confirm if we got this right from yesterday [13:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:47] T311475: Decommission db[2071-2092] - https://phabricator.wikimedia.org/T311475 [13:08:17] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:08:19] weird [13:08:26] not sure why it's complaining! it's the same line from yesterday [13:08:42] checking [13:09:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2081.codfw.wmnet [13:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:22] sukhe: protocol device I think [13:09:47] XioNoX: but it matches https://gerrit.wikimedia.org/r/c/operations/puppet/+/809205/2/modules/bird/templates/bird_anycast.conf.erb? [13:10:23] (03PS1) 10Marostegui: mariadb: Remove db2081 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/809593 (https://phabricator.wikimedia.org/T311623) [13:10:35] yeah [13:11:02] sukhe@durum1001:~$ apt-cache policy bird [13:11:02] bird: Installed: (none) [13:11:06] so we are definitely on bird2, that's not the issue [13:11:17] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:11:28] clean puppet run, with the exception of bird2 failing but that's what we are taking about right now [13:11:32] sukhe: yeah I think it's direct, not device, dunno how I made that oversight [13:11:47] (03PS1) 10Slyngshede: profile::prometheus::ops enable Ganeti metric scraping. [puppet] - 10https://gerrit.wikimedia.org/r/809594 [13:12:02] don't worry about it, you are not alone :) [13:12:04] fixiing [13:12:09] can you check the routers please for now? [13:12:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:12:11] doing it manually to test [13:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:16] I updated it manually [13:12:28] 10SRE-tools, 10Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (10Marostegui) I ran swapoff -a on db2081 and it went fine. Could be coincidence or it could've been the fix. Hard to know. However, I guess it doesn't hurt to inc... [13:12:54] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [13:12:55] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648 [13:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:01] urbanecm: sorry, looking now [13:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:13:05] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [13:13:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:10] no problem. let me know how it goes :) [13:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:13] XioNoX: but wait, we already have direct on line 37 [13:13:18] removing that one [13:13:25] sukhe: wait I'm editing it too [13:13:31] ok waiting [13:13:41] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:13:48] sukhe: alright there we are [13:14:05] urbanecm: 809010 looks good [13:14:14] sukhe: so we need an empty `protocol device` [13:14:27] MatmaRex: ack. I'll sync them at once, so waiting for all of them to be checked. [13:14:33] and then protocol direct with v4/v6 [13:14:40] (03PS1) 10Ssingh: bird: update bird.conf (replace protocol device with direct) [puppet] - 10https://gerrit.wikimedia.org/r/809595 [13:14:49] sukhe: I inverted the two [13:15:15] and the v4 and v6 prefixes are there [13:15:20] XioNoX: patch out, let's review and do another Puppet run to be sure?! [13:15:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/809595 <-- [13:15:34] yep [13:16:12] (03CR) 10Ayounsi: "One comment" [puppet] - 10https://gerrit.wikimedia.org/r/809595 (owner: 10Ssingh) [13:16:16] 809012 seems okay [13:16:28] (03CR) 10Ayounsi: bird: update bird.conf (replace protocol device with direct) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809595 (owner: 10Ssingh) [13:16:35] sukhe: one comment then lgtm [13:16:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2081 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/809593 (https://phabricator.wikimedia.org/T311623) (owner: 10Marostegui) [13:16:37] aaha [13:16:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:00] and the last mediawikiwiki changes look good too [13:17:03] urbanecm: all look good [13:17:09] XioNoX: this bird better fly now :P [13:17:09] (03PS2) 10Ssingh: bird: update bird.conf (replace protocol device with direct) [puppet] - 10https://gerrit.wikimedia.org/r/809595 [13:17:16] MatmaRex: okay, thanks. syncing! [13:17:31] sukhe: everything flies with enough thrust [13:17:46] that sounds about right, given the current circumstances :D [13:18:27] ok I am merging this then [13:18:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2081.codfw.wmnet [13:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:42] (03CR) 10Ayounsi: [C: 03+1] bird: update bird.conf (replace protocol device with direct) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809595 (owner: 10Ssingh) [13:18:46] sukhe: +1 [13:18:49] (03CR) 10Ssingh: [C: 03+2] bird: update bird.conf (replace protocol device with direct) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809595 (owner: 10Ssingh) [13:18:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1027.eqiad.wmnet with reason: host reimage [13:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:25] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2081 - https://phabricator.wikimedia.org/T311623 (10Marostegui) This is ready for on-site steps! [13:20:07] XioNoX: looks good! [13:20:11] can you confirm the router side? [13:20:15] checking [13:20:26] yep, all good1 [13:20:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1026.eqiad.wmnet with OS buster [13:20:34] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36121/console" [puppet] - 10https://gerrit.wikimedia.org/r/809594 (owner: 10Slyngshede) [13:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1026.eqiad.wmnet with OS... [13:21:18] sukhe: the prometheus exporter is working fine too [13:21:22] nice! [13:22:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1027.eqiad.wmnet with reason: host reimage [13:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 78fe6a15: 9f76648: 897e69c7: 977e57b: DiscussionTools config changes (T310960, T298221, T311023) (duration: 03m 38s) [13:23:40] sukhe: what's the next host? [13:23:46] MatmaRex: and all live. anything else i can do for you today? [13:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:49] T311023: Enable new topic tool by default on enwiki - https://phabricator.wikimedia.org/T311023 [13:23:49] T298221: [Config Change] Offer mobile Reply and New Discussion Tools at partner wikis - https://phabricator.wikimedia.org/T298221 [13:23:49] XioNoX: so all looks good? [13:23:49] T310960: [Config Change] Make all DiscussionTools available by default at mediawiki.org - https://phabricator.wikimedia.org/T310960 [13:23:55] we can do durum1002, to be extra sure [13:23:56] urbanecm: thank you! [13:23:58] then roll out to all durums [13:24:02] and then A:wikidough [13:24:04] no problem :) [13:24:14] these should go fairly quickly, I will use depdeploy [13:24:33] sukhe: yeah everything is good, routers and icinga happy [13:24:42] sukhe: +1 [13:24:47] yayay [13:24:52] ok let's try durum1002 [13:26:00] (03PS1) 10Muehlenhoff: Disable swap before running wipefs [cookbooks] - 10https://gerrit.wikimedia.org/r/809599 (https://phabricator.wikimedia.org/T311593) [13:26:09] go for it [13:26:48] https://puppetboard.wikimedia.org/report/durum1002.eqiad.wmnet/1e625a6a37f0a350b28416bfdcd8773e349ab8e4 [13:26:51] looks good [13:26:55] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/normal.106.svg and https://people.wikimedia.org/~ladsgroup/m... [13:27:06] can you verify the routers for this too? [13:27:13] just to be extra safe [13:28:35] sukhe: checked and all good, bgp, bfd [13:28:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) @jclark cloudcephosd1025 states no cable, can you verify the cable and/or the port please [13:28:42] nice! [13:28:43] ok then [13:28:49] I think we can do a A:durum deploy and move on [13:29:04] sukhe: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=durum1002&service=DPKG [13:29:08] probably need a re-check [13:29:26] checking [13:29:54] RECOVERY - DPKG on durum1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:29:55] (scheduled a re-check) [13:29:57] yep [13:29:58] (03PS2) 10Filippo Giunchedi: profile::prometheus::ops enable Ganeti metric scraping. [puppet] - 10https://gerrit.wikimedia.org/r/809594 (owner: 10Slyngshede) [13:30:00] interesting [13:30:00] sukhe: ^ [13:30:04] though I wonder why that happened at all [13:30:15] let me check the puppet run once more [13:31:03] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) ^ Those two graphs are between db1132 (with performance_schema disabled) and a normal 10.4 host. They look _a lot_ more... [13:31:17] hm ok [13:31:20] nothing that I can find [13:31:33] let's go ahead with the rest of the durum hosts for now :) [13:31:39] +! [13:31:40] 1 [13:31:50] running apt update on A:durum followed by debdeploy [13:32:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1026.eqiad.wmnet with reason: host reimage [13:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:43] 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10klausman) [13:33:10] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/809594 (owner: 10Slyngshede) [13:34:01] ok running puppet agent [13:35:00] * sukhe waits [13:35:21] sukhe: is apt update needed or just to be safe? [13:35:28] yep, did that [13:35:31] and then debdeployed [13:35:35] and now running puppet agent [13:35:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1026.eqiad.wmnet with reason: host reimage [13:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:50] sukhe: but I'm wondering if it's strictly needed or not [13:35:57] I'm not familiar with the deb side of things [13:36:23] XioNoX: I don't think it was required as debmonitor was already showing the upgrade, but https://wikitech-static.wikimedia.org/wiki/Software_deployment [13:36:28] > Note that debdeploy won't run apt update, so if you uploaded the new version very recently, it won't make any changes [13:36:49] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:36:53] oh oh [13:37:01] esams [13:37:03] checking [13:37:34] bird looks OK though [13:37:45] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:37:51] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:37:53] v4 is good, v6 not [13:38:05] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:38:07] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:38:23] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:24] hmm bird6 (!) [13:38:57] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:29] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:39:34] ha Puppet failed [13:39:59] sukhe: bird6 is down but bird is still on 1.6 [13:40:11] looking at durum3001 [13:40:29] should I try a force puppet run or you're on it? [13:40:32] doing it [13:40:50] let's see if it fixes it [13:41:14] Package[bird2] failure purged present [13:41:29] it can't install bird2 [13:42:08] maybe because of The following packages will be DOWNGRADED: prometheus-bird-exporter [13:42:09] ? [13:42:28] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Papaul) [13:42:38] yeah E: Packages were downgraded and -y was used without --allow-downgrades. [13:42:40] I think for some reason the debdeploy for prometheus-didn't run [13:42:46] so that's probably it then [13:42:54] checking [13:43:00] ok [13:43:07] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:09] we should see recoveries on esams [13:43:10] ok great [13:43:13] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Papaul) [13:43:23] it's great that v4 stayed up through it [13:43:33] :D [13:43:58] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2081 - https://phabricator.wikimedia.org/T311623 (10Papaul) [13:44:34] confirmed that everything is working now in esams [13:45:29] thanks, checking ulsfo quickly [13:45:34] to make sure changes propagated [13:45:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:04] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:27] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:47:31] alright [13:47:41] sukhe: did it recover on its own or you did something? [13:47:52] XioNoX: just ran install via cumin and puppet agent again :) [13:47:55] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:48:21] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:48:21] it seems like I ignored the version number in prometheus-bird-exporter and hence it is a downgrade [13:48:29] but that's OK, we can fix it later quickly [13:48:54] for now, everything should be fine with durum [13:49:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1028.eqiad.wmnet with OS buster [13:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:06] Wikidough is next and should go a bit more smoothly [13:49:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1028.eqiad.wmnet with OS... [13:49:17] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:49:17] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 95, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:49:20] can you confirm the routers for me please? [13:49:35] durums look good [13:50:01] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:50:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1029.eqiad.wmnet with OS buster [13:50:12] sukhe: checking drmrs as a random one, I trust icinga for the ithers [13:50:14] others [13:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1029.eqiad.wmnet with OS... [13:50:28] Icinga looks ok for durum*; puppet runs which should clear up [13:50:40] (already cleared on puppetboard) [13:50:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:57] (03PS1) 10Btullis: Update the partman recipe for use with the new stat servers [puppet] - 10https://gerrit.wikimedia.org/r/809602 (https://phabricator.wikimedia.org/T307399) [13:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:01] sukhe: drmrs lgtm [13:51:12] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Papaul) [13:51:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1030.eqiad.wmnet with OS buster [13:51:30] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Papaul) [13:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1030.eqiad.wmnet with OS... [13:51:53] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2081 - https://phabricator.wikimedia.org/T311623 (10Papaul) [13:52:00] ok [13:52:04] A:wikidough is next [13:52:13] this time it should go smoothly to inspire confidence for internal recursors :P [13:52:31] XioNoX: ready to proceed? [13:52:35] this will be a fun one [13:52:45] sukhe: yep [13:52:53] deep breaths! [13:52:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS buster [13:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS... [13:53:32] sukhe: which host first? [13:54:07] https://debmonitor.wikimedia.org/search?q=doh [13:54:08] looks clean [13:54:12] packages upgraded [13:54:22] doh1001 [13:54:26] starting now with puppet run [13:54:37] (03PS3) 10Jcrespo: cli: Change logging to log on a different file each [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809589 (https://phabricator.wikimedia.org/T311215) [13:54:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809602 (https://phabricator.wikimedia.org/T307399) (owner: 10Btullis) [13:54:57] alright [13:54:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1032.eqiad.wmnet with OS buster [13:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1032.eqiad.wmnet with OS... [13:55:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1033.eqiad.wmnet with OS buster [13:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1033.eqiad.wmnet with OS... [13:55:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1034.eqiad.wmnet with OS buster [13:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1034.eqiad.wmnet with OS... [13:56:07] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:17] (03PS1) 10Marostegui: wmnet: Update x1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/809605 (https://phabricator.wikimedia.org/T300472) [13:56:21] sukhe: v4 and v6 are down [13:56:25] up now [13:56:30] probably during the restart? [13:56:58] looks ok otherwise [13:57:00] sukhe: receiving 1 v4 and 1 v6 prefix [13:57:01] (03CR) 10Muehlenhoff: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/809602 (https://phabricator.wikimedia.org/T307399) (owner: 10Btullis) [13:57:02] yep [13:57:03] lgtm [13:57:03] nice [13:57:04] ! [13:57:13] clean puppet run too [13:57:28] sukhe: I have a meeting in 3min, but will keep an eye in here [13:57:32] thanks [13:57:33] so [13:57:36] Wikidough then? [13:57:41] intenral recursors or centrallog? [13:57:42] and do the verification [13:57:50] both internal recurors and centrallog have differnet configs [13:57:52] namely, just V4 [13:58:04] sukhe: wikidough, then centrallog [13:58:08] cool [13:58:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1027.eqiad.wmnet with OS buster [13:58:11] then dns [13:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) @Jclark-ctr cloudcephosd1031 same thing, no cable, can you check this as well [13:58:15] ok awesome [13:58:18] good luck to us! [13:58:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1027.eqiad.wmnet with OS bus... [13:58:20] doing wikdough now [13:59:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has identified the cause of the reversed device names. Essentia... [13:59:57] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/809605 (https://phabricator.wikimedia.org/T300472) (owner: 10Marostegui) [14:00:15] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1031.eqiad.wmnet with OS buster [14:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS bus... [14:00:36] (03PS1) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472) [14:00:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1028.eqiad.wmnet with reason: host reimage [14:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:46] (03CR) 10Btullis: [C: 03+2] Update the partman recipe for use with the new stat servers [puppet] - 10https://gerrit.wikimedia.org/r/809602 (https://phabricator.wikimedia.org/T307399) (owner: 10Btullis) [14:01:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1029.eqiad.wmnet with reason: host reimage [14:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:40] !log sudo cumin -b 1 -s 5 'A:wikidough' 'run-puppet-agent -q' [14:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:47] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472) (owner: 10Marostegui) [14:02:09] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:02:20] ^ should recover [14:02:27] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update x1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/809605 (https://phabricator.wikimedia.org/T300472) (owner: 10Marostegui) [14:02:52] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1030.eqiad.wmnet with reason: host reimage [14:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:24] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1030.eqiad.wmnet with reason: host reimage [14:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1028.eqiad.wmnet with reason: host reimage [14:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:34] sukhe: it's all established on the device [14:04:38] so yeah should recover [14:04:40] yep [14:04:42] going smooth so far [14:04:46] Ok to proceed on 12 hosts? Enter the number of affected hosts to confirm or "q" to quit 12 [14:04:50] PASS |█████████████████████████████ | 33% (4/12) [05:20<11:37, 87.13s/hosts] [14:04:58] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host stat1010.eqiad.wmnet with OS bullseye [14:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with errors: - stat101... [14:05:06] and receiving the prefixes there too [14:05:32] nice [14:06:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1032.eqiad.wmnet with reason: host reimage [14:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:26] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (10MoritzMuehlenhoff) >>! In T311236#8023441, @MoritzMuehlenhoff wrote: >>>! In T311236#8023420, @jbond wrote: >>>but that bails out with a bean error related to the fasterxml par... [14:06:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1033.eqiad.wmnet with reason: host reimage [14:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1034.eqiad.wmnet with reason: host reimage [14:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1029.eqiad.wmnet with reason: host reimage [14:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I'm updating the BIOS as well from version 2.13.3 to version 2.14.2 since it was marked as urgent by Dell. {F35286599,width=80%} [14:07:31] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472) (owner: 10Marostegui) [14:09:48] sukhe: it won't show a recovery here as it's curently in warning for a different thing [14:09:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1032.eqiad.wmnet with reason: host reimage [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:00] sukhe: see the warnings in https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=bgp [14:10:23] nice timing :) [14:11:33] XioNoX: I just confirmed the version issue with prometheus-bird-exporter with moritzm [14:11:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1033.eqiad.wmnet with reason: host reimage [14:11:43] we will fix it after, but yeah, it doesn't affect any functionality [14:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:12] (03PS1) 10Hokwelum: make script send mail whenever there is output [puppet] - 10https://gerrit.wikimedia.org/r/809611 [14:12:45] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host stat1010.eqiad.wmnet with OS bullseye [14:12:46] (03CR) 10CI reject: [V: 04-1] make script send mail whenever there is output [puppet] - 10https://gerrit.wikimedia.org/r/809611 (owner: 10Hokwelum) [14:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye [14:13:33] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:46] ^ expected [14:13:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) [14:13:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1026.eqiad.wmnet with OS buster [14:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1026.eqiad.wmnet with OS bus... [14:14:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1034.eqiad.wmnet with reason: host reimage [14:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:25] (03PS2) 10Hokwelum: make script send mail whenever there is output [puppet] - 10https://gerrit.wikimedia.org/r/809611 [14:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:18:18] XioNoX: A:wikidough done, no issues at my end [14:18:25] I will wait for confirmation from you before moving to centrallog [14:18:45] > tests/test_dns.py::test_show_information Your resolver is doh1001 [14:19:38] sukhe: nice! yeah let's move to centrallog [14:20:17] oh just two hosts here [14:20:22] in which case I will do one by bone [14:21:43] (03CR) 10ArielGlenn: [C: 03+2] make script send mail whenever there is output [puppet] - 10https://gerrit.wikimedia.org/r/809611 (owner: 10Hokwelum) [14:22:55] sukhe: yeah, and I'm fully back [14:24:06] clean puppet run on centrallog1001 [14:24:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1028.eqiad.wmnet with OS buster [14:24:13] this was the first host with just the IPv4 config [14:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1028.eqiad.wmnet with OS bus... [14:24:36] no bird6 on this host anyway so yeah [14:25:00] (03PS1) 10Majavah: Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq"" [puppet] - 10https://gerrit.wikimedia.org/r/809557 [14:25:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1030.eqiad.wmnet with OS buster [14:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1030.eqiad.wmnet with OS bus... [14:26:04] sukhe: easy :) [14:26:07] urbanecm: did my config changes get reverted, or not deployed? i do not see them any more [14:26:33] XioNoX: looking good? [14:26:33] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq"" [puppet] - 10https://gerrit.wikimedia.org/r/809557 (owner: 10Majavah) [14:26:38] I have no tests for centrallog so hard to say :) [14:26:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) Are we proceeding with the deploy of cloudnet1005/cloudnet1006 prior to T310546 and T310547 being completed? As discussed we'd be relat... [14:26:47] sukhe: yep, prefix is being advertised [14:26:51] nice [14:26:52] ok [14:26:57] so centrallog2002 is a special case then [14:26:59] it's on bullseye [14:27:14] we need to build anycast-hc for bullseye then [14:27:41] prometheus-bird-exporter is good [14:27:41] Depends: bird | bird2, libc6 (>= 2.4) [14:27:55] we can skip this one for now then, or I can build and import the package quickly [14:27:56] ohhh [14:27:58] yep :) [14:28:12] yeah bird2 and the exporter are in upstream but not anycast-hc [14:28:18] yep [14:28:38] sukhe: up to you, are any of the dns hosts on bullseye? [14:29:03] drmrs is still buster [14:29:20] so I guess it's the only one [14:29:29] sukhe: +1 to skip it for now and test dns [14:29:38] XioNoX: yep, this is the only one [14:29:40] ok [14:29:47] dunno how much time/effort it is to build it [14:29:56] not much [14:30:01] so I will get it done today *after* this [14:30:07] since we should not keep puppet disabled for too long [14:30:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1034.eqiad.wmnet with OS buster [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1034.eqiad.wmnet with OS bus... [14:30:31] sukhe: yep [14:30:36] next is A:dns-rec :) [14:30:46] can anyone help me investigate a problem with the backports/configs from last window? i can see their effect when using any mwdebug server, but not when accessing the sites normally [14:30:59] sukhe: should be as smooth as centrallog1002 [14:31:07] :D [14:31:15] ok I am going to proceed with one to start with [14:31:21] any favourites? if not, dns1001 it is [14:31:28] I am a bit biased towards eqiad, what can I say [14:31:31] sukhe: 1001 is good [14:31:34] ok! [14:31:44] especially as it's the one that will get the most care [14:32:00] e.g. when visiting https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&action=edit§ion=new , you should get a different form [14:32:47] (03CR) 10Muehlenhoff: [C: 03+2] Inline ganeti::kvm [puppet] - 10https://gerrit.wikimedia.org/r/809157 (owner: 10Muehlenhoff) [14:33:17] the changes are still there in git, so i think they were not deployed properly [14:33:22] XioNoX: running agent on dns1001 [14:34:10] alright [14:34:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:23] any deployers around? [14:34:26] ^ expected, hopefully [14:34:54] XioNoX: done [14:34:59] sukhe: confirmed1 [14:35:00] ! [14:35:08] ! [14:35:13] both prefixes are being received [14:35:16] phew [14:35:27] MatmaRex: I’m here but didn’t pay attention during the window [14:35:36] XioNoX: I will do one more manually :) [14:35:38] then we can run cumin [14:35:39] MatMaRex: I'm around [14:35:45] dns1002 [14:36:03] sukhe: sounds good! [14:36:03] Lucas_WMDE: dancy: thanks. it seems to me that my patches are deployed on all mwdebug servsers, but not on normal servers [14:36:15] e.g. try visiting https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&action=edit§ion=new [14:36:20] (while logged out) [14:36:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1029.eqiad.wmnet with OS buster [14:36:28] you'll get a different interface depending on mwdebug or not [14:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1029.eqiad.wmnet with OS bus... [14:36:38] sukhe: even if they break everywhere else, having 2 working servers are enough to prevent a meltdown [14:36:53] MatMaRex: Can you point me to the commit in question? [14:36:55] oh yeah, the magic of anycast :) [14:37:19] dancy: the last 4 commits. e.g. https://gerrit.wikimedia.org/r/c/809010/ [14:37:33] MatmaRex: I seem to get the same interface with both [14:37:42] is “Welcome to the village pump for technical issues” with the large info icon the new interface? [14:38:01] Lucas_WMDE: if you're logged in, you might have some preferences enabled that affect it, try logged out [14:38:04] XioNoX: done [14:38:06] dns1002 [14:38:07] I’m in a private widnow [14:38:24] and got that interface when I first loaded the page, before touching the WikimediaDebug extension [14:38:38] old: https://i.imgur.com/HzO89iI.png new: https://i.imgur.com/kWNnG5K.png [14:38:54] sukhe: confirmed working! [14:38:54] then I get new without WikimediaDebug [14:39:02] XioNoX: :D [14:39:04] ok then [14:39:12] there are people reporting it at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#Doing_this too [14:39:12] running on A:dns-rec, batch of 1, 5 seconds apart [14:39:17] and I think *afterwards* I briefly got old, don’t remember if with or without WikimediaDebug, before reloading [14:39:23] so to me it feels like it might be flaky [14:39:23] sukhe: godspeed [14:39:25] MatmaRex: i definitely didn't revert them [14:39:26] i guess might depend on the location? or not synced to all servsers? [14:39:38] But yesterday a sync only affected half of fleet [14:39:42] So perhaps it happened again [14:39:47] or it's a caching thing but it shouldn't affect this [14:40:03] I just got the old interface from mw1413 [14:40:16] I shutdown my laptop already, Lucas_WMDE if you can just resync IS.php, that'd be helpful [14:40:27] Lemme look at the initializesettings.php file on mw1413 first. [14:40:32] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, 10Service-deployment-requests: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (10akosiaris) I think this is done? [14:40:32] Okay [14:40:34] now I got the new interface from mw1391 [14:40:37] dancy: ack [14:40:48] dancy: note mw1413 had a very similar issue yesterday with my deploy [14:40:50] otherwise yeah resync sounds sensible [14:40:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1033.eqiad.wmnet with OS buster [14:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1033.eqiad.wmnet with OS bus... [14:41:03] urbanecm: I saw that. Very odd. [14:41:18] The main thing that has changed recently is disabling opcache revalidation and enabling php-rpm restarts. [14:41:22] urbanecm: do you remember if scap printed any warnings? (the SAL entry looks normal, at least) [14:41:26] (unconditional restarts, that is) [14:41:28] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, 10Patch-For-Review: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10akosiaris) This is lacking just the entrypoint.sh item, right? [14:41:36] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, 10Patch-For-Review: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10akosiaris) [14:41:40] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (10akosiaris) [14:42:23] (03PS1) 10Muehlenhoff: ganeti: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809616 (https://phabricator.wikimedia.org/T308013) [14:43:34] mw1413 has the right wmf-config/InitialiseSettings.php file [14:43:47] looks like that to me too [14:43:54] (same sha256sum as mw1391) [14:44:02] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:44:15] is it worth restarting php-fpm again? [14:44:39] I'll run sync-wikiversions [14:45:09] sukhe: all smooth? [14:45:12] In progress. [14:45:19] (03PS1) 10Majavah: Revert "Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq""" [puppet] - 10https://gerrit.wikimedia.org/r/809560 [14:45:25] XioNoX: so far yep, running agent on A:dns-rec :) [14:45:42] batches of 2, 2 seconds apart. I thought 5 seconds was a bit too much [14:45:51] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:46:19] dns2001 is done, you can check that [14:46:59] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq""" [puppet] - 10https://gerrit.wikimedia.org/r/809560 (owner: 10Majavah) [14:48:06] alright [14:48:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1032.eqiad.wmnet with OS buster [14:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1032.eqiad.wmnet with OS bus... [14:49:08] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: Debugging [14:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:18] sukhe: all good for 2001 [14:49:19] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10MoritzMuehlenhoff) >>! In T67270#8016999, @jbond wrote: >> In such cases it might make sense to align such files by relicensing to Apache 2 > sta... [14:49:25] OK php-rpm restarts finished [14:49:33] MatMaRex: Any changes? [14:49:43] XioNoX: 6/12 :) [14:49:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] Setup .gitconfig for mwpresync system user [puppet] - 10https://gerrit.wikimedia.org/r/809297 (https://phabricator.wikimedia.org/T303857) (owner: 10Ahmon Dancy) [14:49:47] oh, I just saw things change when I reloaded the page. [14:50:24] dancy: thanks, looks fixed to me [14:50:33] I also seem to be getting the new interface consistently now [14:50:37] so that leaves us with... "what the hell" [14:50:38] XioNoX: after this, that leaves us with authdns1001 and authdns2001 [14:50:40] thanks dancy! [14:51:07] sukhe: those two should be the exact same as dnsXXX [14:51:10] just different name [14:51:10] yep [14:51:27] should I start them in parallel? I don't see any issues [14:51:46] so what happened here? the files were synced everywhere, but the web service or something didn't "refresh" them? [14:52:10] sukhe: sure [14:52:36] That's what it seems like. There is caching involved w/ the data in InitialiseSettings.php. [14:52:58] and someone was telling me about a possibility of bad caching interactions. I'll get more info. [14:53:19] huh. thanks [14:54:06] XioNoX: now running agent on authdns1001 [14:54:43] alright [14:55:49] dns-rec all done [14:55:58] authdns1001 too [14:56:48] awesome [14:57:18] confirmed all good [14:57:30] ! :D [14:57:37] ok let's do the final one [14:57:49] and for centrallog, I will do it later in the day [14:57:52] I don't think you have to be around for that [14:57:58] doing authdns2001 [14:59:22] (03PS1) 10Clare Ming: Add jawiki, zhwikinews to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809620 (https://phabricator.wikimedia.org/T311419) [15:00:31] XioNoX: all done :) [15:00:39] sukhe: woot [15:01:17] phew! [15:01:47] that was something. I guess third time is the charm really is something [15:01:51] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Papaul) [15:02:08] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2071 - https://phabricator.wikimedia.org/T311589 (10Papaul) 05Open→03Resolved Complete [15:02:14] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Papaul) [15:02:16] XioNoX: thanks for the help and patience! I am around to monitor things and I will push centtrallog as well shortly [15:02:27] and then we can fix the prometheus-bird thing, but that's not urgent in any form [15:02:35] sukhe: no, thank you! [15:02:38] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2081 - https://phabricator.wikimedia.org/T311623 (10Papaul) [15:02:43] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2075 - https://phabricator.wikimedia.org/T311591 (10Papaul) 05Open→03Resolved complete [15:02:50] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2081 - https://phabricator.wikimedia.org/T311623 (10Papaul) 05Open→03Resolved complete [15:07:52] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:23:28] (03PS1) 10Muehlenhoff: calico: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809624 (https://phabricator.wikimedia.org/T308013) [15:23:30] (03PS1) 10Muehlenhoff: ores: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809625 (https://phabricator.wikimedia.org/T308013) [15:23:32] (03PS1) 10Muehlenhoff: rancid: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809626 (https://phabricator.wikimedia.org/T308013) [15:25:47] !log upload anycast-healthchecker 0.8.2-1wm1 to apt.wm.o (bullseye) - T310574 [15:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:53] T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 [15:26:26] (03PS1) 10Muehlenhoff: jupyterhub: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809628 (https://phabricator.wikimedia.org/T308013) [15:31:25] (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809620 (https://phabricator.wikimedia.org/T311419) (owner: 10Clare Ming) [15:31:35] (03PS1) 10Muehlenhoff: Add Paul Norman to contributors [puppet] - 10https://gerrit.wikimedia.org/r/809629 (https://phabricator.wikimedia.org/T308013) [15:33:32] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:43:53] (03PS3) 10Lucas Werkmeister (WMDE): Increase weights on the language selector statement boosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [15:44:01] since things are looking quiet at the moment, I’ll deploy ^ [15:44:16] shouldn’t have any effect yet, just one less thing to deploy later :) [15:44:17] Lucas_WMDE: +1 :) [15:44:46] (03CR) 10Hashar: [C: 03+1] Increase weights on the language selector statement boosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [15:45:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Increase weights on the language selector statement boosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [15:46:19] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [15:46:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10RobH) >>! In T307399#8037215, @BTullis wrote: > The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has ident... [15:47:19] (03Merged) 10jenkins-bot: Increase weights on the language selector statement boosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808941 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [15:48:22] syncing [15:48:38] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>! In T310980#8036017, @MoritzMuehlenhoff wrote: > Can't we just import the Cassandra 4 debs and use those? The work needs to happen at some point anyway and it's a fresh cluster. Buster... [15:48:41] (briefly tested on mwdebug1001 that searchEntities.php didn’t crash) [15:50:19] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) 05Open→03Resolved Indeed, I see no more errors since Arelion investigated earlier this week and since then the errors have cleared up. This is great that its resolved, but not so great in that no... [15:51:40] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:808941|Increase weights on the language selector statement boosts (T307869)]] (expected to be a no-op) (duration: 03m 21s) [15:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:47] T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869 [15:53:09] * Lucas_WMDE done [15:53:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:54:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:59] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) I'd be willing to work on it, but my fear is that it becomes a projects in itself that takes a long time to finish (without proper planning and resource allocation). If everybody agrees... [15:55:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:03] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10BTullis) I have been doing some testing of an install of a server with an H750 card under ticket T307399 recently. One thing that I have ascertained, with the help of @fgiunchedi, is that the swapping o... [15:58:55] (03CR) 10Ottomata: [C: 03+1] jupyterhub: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809628 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:00:37] (03PS1) 10Majavah: Revert "Revert "openstack::nova: enable TLS encryption for rabbitmq"" [puppet] - 10https://gerrit.wikimedia.org/r/809633 [16:01:28] (03CR) 10Herron: [C: 03+1] "LGTM, please see optional nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [16:07:54] (03PS1) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) [16:10:14] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:28] (03PS2) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) [16:15:28] (03CR) 10CI reject: [V: 04-1] [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: 10Eigyan) [16:15:59] (03PS3) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) [16:22:08] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648 [16:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:15] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [16:22:54] (03PS3) 10Cwhite: loki: add loki as an optional grafana component [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) [16:23:25] (03CR) 10Cwhite: loki: add loki as an optional grafana component (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [16:24:13] (03PS4) 10Cwhite: loki: add loki as an optional grafana component [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) [16:25:11] (03PS5) 10Cwhite: loki: add loki as an optional grafana component [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) [16:27:07] please note that Puppet is intentionally disabled on centrallog2002 (as also in the disable message) [16:27:23] I will enable it in the afternoon, but please don't do it before that as it will break anycast. thank you :) [16:28:40] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:33:03] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809637 [16:33:10] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298560)', diff saved to https://phabricator.wikimedia.org/P30624 and previous config saved to /var/cache/conftool/dbconfig/20220629-163612-ladsgroup.json [16:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:20] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [16:42:42] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:58] (03PS1) 10Eevans: Assign new password to Cassandra superuser [labs/private] - 10https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652) [16:43:26] (03PS2) 10Eevans: Assign new password to Cassandra superuser [labs/private] - 10https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652) [16:48:12] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [16:51:03] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809637 (owner: 10PipelineBot) [16:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P30625 and previous config saved to /var/cache/conftool/dbconfig/20220629-165117-ladsgroup.json [16:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Engineering-Kanban: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I tried the `partman/custom/kafka-jumbo.cfg` partman recipe on this how, but it didn't seem to be applied. When I checked the log I saw thi... [16:51:41] (03PS1) 10Btullis: Reduce the minimum size of /srv in the kafka-jumbo recipe [puppet] - 10https://gerrit.wikimedia.org/r/809640 (https://phabricator.wikimedia.org/T307399) [16:54:04] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/809637 (owner: 10PipelineBot) [16:56:14] (03PS1) 10RobH: testing h750 recipes [puppet] - 10https://gerrit.wikimedia.org/r/809641 (https://phabricator.wikimedia.org/T297913) [16:56:45] (03CR) 10Btullis: [C: 03+2] Reduce the minimum size of /srv in the kafka-jumbo recipe [puppet] - 10https://gerrit.wikimedia.org/r/809640 (https://phabricator.wikimedia.org/T307399) (owner: 10Btullis) [16:56:50] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) >>! In T297913#8037888, @gerritbot wrote: > Change 809641 had a related patch set uploaded (by RobH; author: RobH): > %%%[operations/puppet@production] testing h750 recipes%%%... [16:57:08] (03PS2) 10RobH: testing h750 recipes [puppet] - 10https://gerrit.wikimedia.org/r/809641 (https://phabricator.wikimedia.org/T302937) [16:57:16] (03PS3) 10RobH: testing h750 recipes [puppet] - 10https://gerrit.wikimedia.org/r/809641 (https://phabricator.wikimedia.org/T302937) [16:58:01] !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [16:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:15] (03CR) 10RobH: [C: 03+2] testing h750 recipes [puppet] - 10https://gerrit.wikimedia.org/r/809641 (https://phabricator.wikimedia.org/T302937) (owner: 10RobH) [16:58:25] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [16:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:10] !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [16:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:46] !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [16:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:00] !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [17:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:30] !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [17:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:04] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [17:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:11] 10SRE, 10DC-Ops, 10Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [17:06:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P30626 and previous config saved to /var/cache/conftool/dbconfig/20220629-170622-ladsgroup.json [17:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:43] (03PS3) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) [17:10:10] (03CR) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [17:11:30] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:15:36] 10SRE, 10DC-Ops, 10Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10RobH) >>! In T302937#8032403, @fgiunchedi wrote: > As mentioned at the SRE meeting @BTullis is also looking into this for DSE hosts (review at https://gerrit.wikimedia.org/r/c/operations/puppet/+/8... [17:17:08] python3-anycast-healthchecker : Depends: python3-pythonjsonlogger but it is not going to be installed [17:17:24] except the package it seems like was called python3-json-logger, with the same upstream source [17:17:29] time for another package rebuild :] [17:17:39] sharing this because Icinga will complain about dpkg status shortly [17:17:44] on centrallog2001 [17:18:05] 10SRE, 10DC-Ops, 10Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10MoritzMuehlenhoff) >>! In T302937#8037948, @RobH wrote: >>>! In T302937#8032403, @fgiunchedi wrote: >> As mentioned at the SRE meeting @BTullis is also looking into this for DSE hosts (review at ht... [17:18:15] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dumpsdata1007.eqiad.wmnet with OS bullseye [17:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:21] 10SRE, 10DC-Ops, 10Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1007 (**FAIL**) - Removed f... [17:19:11] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [17:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:18] 10SRE, 10DC-Ops, 10Patch-For-Review: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [17:21:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298560)', diff saved to https://phabricator.wikimedia.org/P30627 and previous config saved to /var/cache/conftool/dbconfig/20220629-172127-ladsgroup.json [17:21:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [17:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:35] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [17:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [17:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:08] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [17:31:00] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 2054 MB (3% inode=97%): /tmp 2054 MB (3% inode=97%): /var/tmp 2054 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [17:31:14] !log running puppet agent on centrallog2002 to finalize T310574 [17:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:19] T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 [17:31:23] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [17:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:46] yes my server, rise... rissssssssseeeeee [17:33:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:53] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [17:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:59] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 (10ssingh) ` ===== NODE GROUP ===== (40) authdns[1001,2001].wikimedia.org,c... [17:40:20] (03PS3) 10Slyngshede: profile::prometheus::ops enable Ganeti metric scraping. [puppet] - 10https://gerrit.wikimedia.org/r/809594 [17:45:09] (03CR) 10Slyngshede: profile::prometheus::ops enable Ganeti metric scraping. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809594 (owner: 10Slyngshede) [17:45:42] (03PS4) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) [17:46:13] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Update CAS to 6.5 - https://phabricator.wikimedia.org/T311235 (10ssingh) p:05Triage→03Medium [17:48:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [17:48:41] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36124/console" [puppet] - 10https://gerrit.wikimedia.org/r/809594 (owner: 10Slyngshede) [17:51:01] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:54] (03CR) 10Jsn.sherman: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: 10Eigyan) [17:54:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) [17:54:29] (03CR) 10EllenR: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: 10Eigyan) [17:54:46] (03PS1) 10Dduvall: gitlab_runner: Allow internal docker DNS traffic [puppet] - 10https://gerrit.wikimedia.org/r/809650 (https://phabricator.wikimedia.org/T311241) [17:54:48] (03CR) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: 10Eigyan) [17:55:15] (03CR) 10Eigyan: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: 10Eigyan) [17:55:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:19] (03PS1) 10Ebernhardson: metastore: Remove versioning from saneitize updates [extensions/CirrusSearch] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809564 [17:59:24] (03CR) 10EllenR: [C: 03+1] "mine are too!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: 10Eigyan) [17:59:43] (03CR) 10Dduvall: "dzahn was able to get things working again with a docker/firewall restart, so I'm not sure this is still necessary." [puppet] - 10https://gerrit.wikimedia.org/r/809650 (https://phabricator.wikimedia.org/T311241) (owner: 10Dduvall) [18:00:05] dduvall and hashar: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T1800). [18:00:06] dduvall and hashar: May I have your attention please! MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T1800) [18:02:17] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:36] (03PS1) 10Dduvall: group1 wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809652 (https://phabricator.wikimedia.org/T308071) [18:02:38] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809652 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall) [18:03:20] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809652 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall) [18:04:22] db1128 is pooled and has a large amount of lag [18:05:07] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:06:01] Amir1: ^ [18:06:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:25] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.18 refs T308071 [18:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:29] T308071: 1.39.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T308071 [18:08:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:08:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:23] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:47] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:53] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata1007 (**FAIL**) - Removed from Puppet and PuppetD... [18:11:00] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.18 refs T308071 (duration: 03m 35s) [18:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1051.mgmt.eqiad.wmnet with reboot policy FORCED [18:11:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1049.mgmt.eqiad.wmnet with reboot policy FORCED [18:11:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1048.mgmt.eqiad.wmnet with reboot policy FORCED [18:11:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1052.mgmt.eqiad.wmnet with reboot policy FORCED [18:11:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1053.mgmt.eqiad.wmnet with reboot policy FORCED [18:11:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt1050.mgmt.eqiad.wmnet with reboot policy FORCED [18:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:49] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it seems? [18:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10Cmjohnson) [18:12:37] Amir1: db1128 is pooled and has a large amount of lag [18:12:39] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:12:50] JJMC89: on it [18:13:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:14:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1128', diff saved to https://phabricator.wikimedia.org/P30628 and previous config saved to /var/cache/conftool/dbconfig/20220629-181438-ladsgroup.json [18:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:56] JJMC89: depooled now [18:14:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:14:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:14] thanks [18:15:42] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata10... [18:18:31] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) >>! In T297913#8038074, @RobH wrote: > So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it seems? That's expected, we still need to... [18:18:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:29] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:19:49] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to wmf for demon - https://phabricator.wikimedia.org/T311661 (10thcipriani) [18:20:07] (03PS3) 10Ssingh: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) [18:20:56] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36125/console" [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [18:21:41] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:11] !log robh@cumin1001 START - Cookbook sre.hosts.provision for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [18:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:39] (03PS3) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) [18:25:55] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [18:26:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:06] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648 [18:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:11] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [18:27:25] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36126/console" [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [18:27:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2153.mgmt.codfw.wmnet with reboot policy FORCED [18:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [18:28:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1050.mgmt.eqiad.wmnet with reboot policy FORCED [18:28:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1053.mgmt.eqiad.wmnet with reboot policy FORCED [18:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:11] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1052.mgmt.eqiad.wmnet with reboot policy FORCED [18:28:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1048.mgmt.eqiad.wmnet with reboot policy FORCED [18:28:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1051.mgmt.eqiad.wmnet with reboot policy FORCED [18:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1049.mgmt.eqiad.wmnet with reboot policy FORCED [18:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2154.mgmt.codfw.wmnet with reboot policy FORCED [18:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:47] (03CR) 10Ssingh: [V: 03+1] "Rebased on top of I95c0009bc06 to allow for backward compatibility." [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [18:29:53] (03CR) 10Ssingh: [V: 03+1] "Rebased on top of I95c0009bc06 to allow for backward compatibility." [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [18:31:22] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) Something is very wrong with dumps1006, when I go to set it up, it doesn't see a 10G NIC, only the 1G. Rather than pollute this seutp task, I'll create a high prio... [18:34:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:35] (03PS1) 10Cmjohnson: Adding new cloudvirt hosts to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/809656 (https://phabricator.wikimedia.org/T299574) [18:49:10] (03CR) 10CI reject: [V: 04-1] Adding new cloudvirt hosts to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/809656 (https://phabricator.wikimedia.org/T299574) (owner: 10Cmjohnson) [18:49:21] (03Abandoned) 10Cmjohnson: Adding new cloudvirt hosts to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/809656 (https://phabricator.wikimedia.org/T299574) (owner: 10Cmjohnson) [18:54:42] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Krenair) [18:56:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2154.mgmt.codfw.wmnet with reboot policy FORCED [18:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2153.mgmt.codfw.wmnet with reboot policy FORCED [18:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:50] (03PS1) 10Cmjohnson: Adding cloudvirt servers cloudvirt servers [puppet] - 10https://gerrit.wikimedia.org/r/809659 (https://phabricator.wikimedia.org/T305194) [18:59:25] (03CR) 10CI reject: [V: 04-1] Adding cloudvirt servers cloudvirt servers [puppet] - 10https://gerrit.wikimedia.org/r/809659 (https://phabricator.wikimedia.org/T305194) (owner: 10Cmjohnson) [19:03:47] (03Abandoned) 10Cmjohnson: Adding cloudvirt servers cloudvirt servers [puppet] - 10https://gerrit.wikimedia.org/r/809659 (https://phabricator.wikimedia.org/T305194) (owner: 10Cmjohnson) [19:05:42] (03PS1) 10Cmjohnson: Adding cloudvirts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194) [19:06:16] (03CR) 10CI reject: [V: 04-1] Adding cloudvirts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194) (owner: 10Cmjohnson) [19:08:34] (03PS2) 10Cmjohnson: Adding cloudvirts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194) [19:09:36] (03PS3) 10Cmjohnson: Adding cloudvirts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194) [19:10:22] (03CR) 10Cmjohnson: [C: 03+2] Adding cloudvirts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809660 (https://phabricator.wikimedia.org/T305194) (owner: 10Cmjohnson) [19:10:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2155.mgmt.codfw.wmnet with reboot policy FORCED [19:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [19:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:41] (03PS1) 10Cmjohnson: Adding cloudvirts to netboot [puppet] - 10https://gerrit.wikimedia.org/r/809661 (https://phabricator.wikimedia.org/T305194) [19:12:31] (03CR) 10Cmjohnson: [C: 03+2] Adding cloudvirts to netboot [puppet] - 10https://gerrit.wikimedia.org/r/809661 (https://phabricator.wikimedia.org/T305194) (owner: 10Cmjohnson) [19:15:41] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [19:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [19:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:48] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-06-28-153911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/809665 [19:22:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [19:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [19:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:53] (03PS1) 10Zabe: Stop setting wgCentralAuthAutoNew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809666 (https://phabricator.wikimedia.org/T257079) [19:24:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10Cmjohnson) [19:24:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) [19:25:21] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-06-28-153911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/809665 (owner: 10BryanDavis) [19:26:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) a:05RobH→03Jclark-ctr Ok, I updated the bios and then foolishly updated idrac, and now https implementation is broken for idrac. {F35287417} @Jclark-ctr (or @... [19:26:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2155.mgmt.codfw.wmnet with reboot policy FORCED [19:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:03] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [19:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [19:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:22] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-06-28-153911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/809665 (owner: 10BryanDavis) [19:30:42] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) >>! In T297913#8038091, @MoritzMuehlenhoff wrote: >>>! In T297913#8038074, @RobH wrote: >> So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it se... [19:31:04] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [19:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:28] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [19:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [19:32:04] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [19:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:41] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [19:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:54] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [19:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:57] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [19:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:15] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:01] (03PS1) 10Zabe: Stop setting wgBabelCentralApi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809671 (https://phabricator.wikimedia.org/T257079) [19:36:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [19:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1048.eqiad.wmnet with OS bullseye [19:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1048.eqiad.wmnet w... [19:36:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2158.mgmt.codfw.wmnet with reboot policy FORCED [19:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2157.mgmt.codfw.wmnet with reboot policy FORCED [19:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:21] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1049.eqiad.wmnet with OS bullseye [19:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1049.eqiad.wmnet w... [19:46:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1050.eqiad.wmnet with OS bullseye [19:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1050.eqiad.wmnet w... [19:46:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1051.eqiad.wmnet with OS bullseye [19:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1051.eqiad.wmnet w... [19:47:01] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1052.eqiad.wmnet with OS bullseye [19:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1052.eqiad.wmnet w... [19:47:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1053.eqiad.wmnet with OS bullseye [19:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1053.eqiad.wmnet w... [19:49:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage [19:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage [19:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Cmjohnson) @ayounsi @taavi @Andrew Has a determination on public vs private VLAN been decided? [19:57:13] (03PS1) 10Bartosz Dziewoński: New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809688 (https://phabricator.wikimedia.org/T311665) [19:57:25] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: Allow internal docker DNS traffic [puppet] - 10https://gerrit.wikimedia.org/r/809650 (https://phabricator.wikimedia.org/T311241) (owner: 10Dduvall) [19:57:33] (03PS1) 10Bartosz Dziewoński: New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809689 (https://phabricator.wikimedia.org/T311665) [19:57:43] (03PS1) 10Bartosz Dziewoński: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809690 (https://phabricator.wikimedia.org/T311597) [19:57:46] (03CR) 10Samtar: "Code is sound, but recommend holding off merging this per T310974#8034803" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [19:57:53] (03PS2) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [19:57:59] (03PS1) 10Bartosz Dziewoński: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809691 (https://phabricator.wikimedia.org/T311597) [19:58:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Cmjohnson) @ayounsi @Andrew Has a determination on public vs private VLAN been decided? Also, @andrew which partman recipe... [19:59:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage [19:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage [19:59:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage [19:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] RoanKattouw, Urbanecm, and cjming: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220629T2000). [20:00:05] cjming, mewoph, eigyan, ebernhardson, zabe, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [20:00:22] Greetings Everyone! [20:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:35] hi all - i can deploy since i'm on the list [20:00:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) @cmooney are you requesting cloudnets to be moved to a different switch or this an open discussion with @Andrew [20:01:08] gey [20:01:12] i'll do them in order so I'll start with mine [20:01:12] s/gey/hey [20:01:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage [20:01:18] cjming: perfect thanks! [20:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:32] (03CR) 10Clare Ming: [C: 03+2] Add jawiki, zhwikinews to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809620 (https://phabricator.wikimedia.org/T311419) (owner: 10Clare Ming) [20:02:04] hi [20:02:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2157.mgmt.codfw.wmnet with reboot policy FORCED [20:02:22] (03Merged) 10jenkins-bot: Add jawiki, zhwikinews to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809620 (https://phabricator.wikimedia.org/T311419) (owner: 10Clare Ming) [20:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2158.mgmt.codfw.wmnet with reboot policy FORCED [20:02:27] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage [20:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:30] i'm aware that i overloaded the window slightly and that i am late, so if it can't be done, it's okay if you drop my patches [20:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:37] (although i hope we can include them) [20:02:54] \o [20:03:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage [20:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:58] (03PS1) 10Ottomata: analytics test cluster presto - configure iceberg with kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) [20:04:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2159.mgmt.codfw.wmnet with reboot policy FORCED [20:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:54] thanks MatmaRex: let's see where we end up during the window -- hopefully we can do all of your patches [20:05:07] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36127/console" [puppet] - 10https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [20:05:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage [20:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:57] (03CR) 10BPirkle: [C: 03+1] "LGTM, I made an additional comment on naming, but am happy to merge without changes if, after a bit more thought, you're happy with it as-" [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [20:06:09] (03CR) 10Clare Ming: [C: 03+2] Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809550 (https://phabricator.wikimedia.org/T311467) (owner: 10Kosta Harlan) [20:06:11] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [20:06:31] mewoph: starting on your patches [20:07:04] cjming: 👍 [20:07:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage [20:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:34] !log cjming@deploy1002 Synchronized wmf-config/config: Config: [[gerrit:809620|Add jawiki, zhwikinews to pilot wikis (T311419)]] (duration: 03m 31s) [20:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:39] T311419: Change default skin on jawiki, zhwikinews to Vector (2022) - https://phabricator.wikimedia.org/T311419 [20:08:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2160.mgmt.codfw.wmnet with reboot policy FORCED [20:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:30] (03PS1) 10Papaul: Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809677 (https://phabricator.wikimedia.org/T306927) [20:09:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:09:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [20:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:05] (03CR) 10CI reject: [V: 04-1] Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809677 (https://phabricator.wikimedia.org/T306927) (owner: 10Papaul) [20:10:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:10:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:02] (03PS2) 10Papaul: Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809677 (https://phabricator.wikimedia.org/T306927) [20:11:02] cjming: is it ok to +2 the wmf17 patch at the same time or do we have to wait for the other one to be merged first? our our tests take about ~20min and i don't want to take up the entire window since there are a lot of patches scheduled [20:11:05] !log cjming@deploy1002 Synchronized dblists/desktop-improvements.dblist: Config: [[gerrit:809620|Add jawiki, zhwikinews to pilot wikis (T311419)]] (duration: 03m 23s) [20:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:30] mewoph: i was just thinking to sync them together -- let's do it [20:11:43] cjming: cool thanks! [20:11:44] (03CR) 10Clare Ming: [C: 03+2] Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809549 (https://phabricator.wikimedia.org/T311467) (owner: 10Kosta Harlan) [20:11:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:42] (03PS2) 10Ottomata: analytics test cluster presto - configure iceberg with kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) [20:13:41] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36128/console" [puppet] - 10https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [20:14:04] 10SRE, 10serviceops-collab, 10Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10Dzahn) [20:14:32] 10SRE, 10serviceops-collab, 10Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10Dzahn) @Arnoldokoth Any reason to keep it open? [20:15:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1048.eqiad.wmnet with OS bullseye [20:15:42] (03PS3) 10Ottomata: analytics test cluster presto - configure iceberg with kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) [20:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1048.eqiad.wmnet with... [20:15:51] (03PS4) 10Ottomata: analytics test cluster presto - configure iceberg with kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) [20:16:39] !log LDAP - mwmaint1002 - added demon to wmf group (T311661) [20:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:44] T311661: Grant Access to wmf for demon - https://phabricator.wikimedia.org/T311661 [20:17:08] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36129/console" [puppet] - 10https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [20:17:23] 10SRE, 10LDAP-Access-Requests, 10Release-Engineering-Team (Radar): Grant Access to wmf for demon - https://phabricator.wikimedia.org/T311661 (10Dzahn) 05Open→03Resolved a:03Dzahn done. Chad is already in shell access group so there is no puppet code change needed for this. added to wmf [20:19:18] (03CR) 10Ottomata: [V: 03+1 C: 03+2] analytics test cluster presto - configure iceberg with kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/809676 (https://phabricator.wikimedia.org/T311525) (owner: 10Ottomata) [20:20:19] 20 mins for CI -- oof [20:21:10] !log restarting docker on all 6 gitlab-runners via cumin T311241 [20:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:17] T311241: DNS/networking not working on Trusted Runners - https://phabricator.wikimedia.org/T311241 [20:21:47] cjming: you could +2 several patches at once, so that the CI can run in parallel [20:22:14] oh, you already talked about it above [20:23:49] MatmaRex: ya - thanks -- when we get to yours, presumably we can +2 all of them at the same time [20:24:05] yeah [20:24:06] (03Merged) 10jenkins-bot: Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809550 (https://phabricator.wikimedia.org/T311467) (owner: 10Kosta Harlan) [20:24:59] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1049.eqiad.wmnet with OS bullseye [20:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1049.eqiad.wmnet with... [20:25:28] mewoph: your .18 patch is on mwdebug1002 - can you test? [20:25:37] cjming: looking now [20:25:40] well, you could even +2 them now, AFAIK nothing bad will happen if they finish and get merged early while you're deploying something else, the actual deployment is still manual, right? (as long as we don't forget about them after the deployment window ends) [20:26:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS buster [20:26:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1050.eqiad.wmnet with OS bullseye [20:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster [20:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1050.eqiad.wmnet with... [20:27:09] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:08] MatmaRex: the only thing that concerns me is if something needs reverting - so i tend to do them linearly to keep what i'm doing straight which isn't usually an issue with a few patches in the window -- with this many, i'm not sure [20:28:26] cjming: lgtm [20:28:31] cool -syncing [20:28:49] yeah, makes sense [20:29:40] mewoph: your .17 patch -- in zuul is that a non-voting failure? [20:30:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) [20:31:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1051.eqiad.wmnet with OS bullseye [20:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1051.eqiad.wmnet with... [20:32:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:21] cjming: i think it's voting but the test that failed is not related to the change [20:32:29] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/GrowthExperiments/modules/ext.growthExperiments.StructuredTask/TargetInitializer.js: Backport: [[gerrit:809550|Structured task: Add 'cancel' to the list of allowed commands (T311467)]] (duration: 03m 37s) [20:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:34] T311467: [wmf.17-mobile] "Suggestions" label has no padding - https://phabricator.wikimedia.org/T311467 [20:32:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:32:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:11] mewoph: your .18 patch should be live [20:33:40] eigyan: i think you're here - i'm going to go ahead and start your patch [20:33:55] great..thanks mewoph [20:33:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:55] (03CR) 10CI reject: [V: 04-1] Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809549 (https://phabricator.wikimedia.org/T311467) (owner: 10Kosta Harlan) [20:36:00] mewoph: i'll try again -- in the meantime i'll continue with the next patch [20:36:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2160.mgmt.codfw.wmnet with reboot policy FORCED [20:36:11] cjming: we can skip the wmf17 for this window, don't want to take up any more time and wmf18 is going out tmr anyway :( [20:36:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2159.mgmt.codfw.wmnet with reboot policy FORCED [20:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:26] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:39] mewoph: ok -- lmk if you change your mind - i can check with you later [20:36:50] (03CR) 10Clare Ming: [C: 03+2] [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: 10Eigyan) [20:37:32] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS buster [20:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec... [20:38:09] (03Merged) 10jenkins-bot: [wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809634 (https://phabricator.wikimedia.org/T311643) (owner: 10Eigyan) [20:38:50] eigyan: your patch should be on mwdebug1002 - can you check? [20:39:08] checking now cjming thank you! [20:41:02] (03CR) 10Papaul: [C: 03+2] Add new db nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809677 (https://phabricator.wikimedia.org/T306927) (owner: 10Papaul) [20:41:08] cjming everthing looks 💯 [20:41:16] cool - syncing now [20:41:28] ebernhardson: i think you're here - is yours a no-op? [20:41:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS buster [20:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS buster [20:41:56] cjming: hmm, it shouldn't be. sec [20:42:07] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1003/36130/" [puppet] - 10https://gerrit.wikimedia.org/r/809302 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:42:09] (03CR) 10Clare Ming: [C: 03+2] metastore: Remove versioning from saneitize updates [extensions/CirrusSearch] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809564 (owner: 10Ebernhardson) [20:42:10] cjming: maybe someone alreaday deployed the change and i didn't notice [20:42:31] ebernhardson: merging yours now [20:42:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2153.codfw.wmnet with OS bullseye [20:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:48] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2153.codfw.wmnet with OS... [20:42:54] cjming: thanks. Yet again this is something i can't really test on mwdebug, it runs from the job queue every 2 hours [20:43:15] ebernhardson: got it - then i'll go ahead and sync [20:43:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS buster [20:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS... [20:43:43] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [20:44:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:45:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:08] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:809634|[wmf-config]: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. (T311643)]] (duration: 03m 25s) [20:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:15] T311643: Deploy GDI Survey Wave 2 on ES,FR,PT wikis. - https://phabricator.wikimedia.org/T311643 [20:45:18] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS buster [20:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS... [20:45:27] (03CR) 10Cwhite: "Incorporated feedback from the meeting today." [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [20:45:28] eigyan: your change should be live [20:45:49] Excellent! thank you cjming! [20:45:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:45:55] np! [20:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:30] zabe: i think you're here too? i will do yours next [20:46:53] ok [20:47:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1052.eqiad.wmnet with OS bullseye [20:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1052.eqiad.wmnet with... [20:48:31] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1007.eqiad.wmnet with OS buster [20:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS buster exec... [20:49:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1006.eqiad.wmnet with OS bullseye [20:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye [20:51:07] MatmaRex: can you rebase your .17 patches? and when Erik's patch is merged, can you also rebase your .18 patches? [20:51:28] sure [20:51:46] if they need it - maybe they don't [20:51:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mw1496.mgmt.eqiad.wmnet with reboot policy FORCED [20:51:55] (03PS2) 10Bartosz Dziewoński: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809690 (https://phabricator.wikimedia.org/T311597) [20:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:06] (03PS2) 10Bartosz Dziewoński: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809691 (https://phabricator.wikimedia.org/T311597) [20:54:22] (03CR) 10Andrew Bogott: openstack: make enc-cli authenticate via keystone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779899 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [20:54:30] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dumpsdata1006.mgmt.eqiad.wmnet with reboot policy FORCED [20:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage [20:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:56] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1053.eqiad.wmnet with OS bullseye [20:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1053.eqiad.wmnet with... [20:56:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1025.eqiad.wmnet with reason: host reimage [20:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage [20:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:43] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) 05Open→03Resolved [20:59:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10Cmjohnson) pinging @andrew so he knows the base image has been completed. Resolving the task. [20:59:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10Cmjohnson) [20:59:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10Cmjohnson) [20:59:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10Cmjohnson) 05Open→03Resolved pinging @andrew so he knows the base image has been completed. Resolving the task. [21:00:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1025.eqiad.wmnet with reason: host reimage [21:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2153.codfw.wmnet with reason: host reimage [21:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:21] PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:29] (03Merged) 10jenkins-bot: metastore: Remove versioning from saneitize updates [extensions/CirrusSearch] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809564 (owner: 10Ebernhardson) [21:02:47] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw1496.mgmt.eqiad.wmnet with reboot policy FORCED [21:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:53] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:03:31] (03CR) 10Clare Ming: [C: 03+2] Stop setting wgCentralAuthAutoNew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809666 (https://phabricator.wikimedia.org/T257079) (owner: 10Zabe) [21:04:01] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:04:53] (03Merged) 10jenkins-bot: Stop setting wgCentralAuthAutoNew [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809666 (https://phabricator.wikimedia.org/T257079) (owner: 10Zabe) [21:05:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2153.codfw.wmnet with reason: host reimage [21:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:22] (03PS2) 10Clare Ming: Stop setting wgBabelCentralApi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809671 (https://phabricator.wikimedia.org/T257079) (owner: 10Zabe) [21:05:36] (03PS1) 10Cwhite: grafana: ldap parameter expects a hash by default [puppet] - 10https://gerrit.wikimedia.org/r/809682 [21:06:28] zabe: your 1st patch is on mwdebug1002 - not sure if it's testable [21:06:52] cjming, it's not really testable [21:06:57] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/CirrusSearch/includes/MetaStore/MetaSaneitizeJobStore.php: Backport: [[gerrit:809564|metastore: Remove versioning from saneitize updates]] (duration: 03m 35s) [21:07:00] then i will sync [21:07:01] I will keep an eye on logstash afterwards [21:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:31] ebernhardson: your change should be live [21:07:55] (03CR) 10Clare Ming: [C: 03+2] Stop setting wgBabelCentralApi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809671 (https://phabricator.wikimedia.org/T257079) (owner: 10Zabe) [21:08:02] cjming: thanks! [21:08:08] np! [21:08:44] (03Merged) 10jenkins-bot: Stop setting wgBabelCentralApi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809671 (https://phabricator.wikimedia.org/T257079) (owner: 10Zabe) [21:09:31] (03CR) 10Cwhite: [C: 03+2] grafana: ldap parameter expects a hash by default [puppet] - 10https://gerrit.wikimedia.org/r/809682 (owner: 10Cwhite) [21:09:58] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [21:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:42] ok MatmaRex: let's do yours if you're still around and up for it [21:10:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mw1496.mgmt.eqiad.wmnet with reboot policy FORCED [21:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:16] !log cjming@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:809666|Stop setting wgCentralAuthAutoNew (T257079)]] (duration: 03m 28s) [21:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:20] T257079: Audit all mismatched/unused wmf-config settings - https://phabricator.wikimedia.org/T257079 [21:12:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:12:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:13:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:17] MatmaRex: do you want to do your patches still? [21:13:39] cjming: sure, if you're okay with it [21:14:01] i am [21:14:12] (03CR) 10Clare Ming: [C: 03+2] New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809688 (https://phabricator.wikimedia.org/T311665) (owner: 10Bartosz Dziewoński) [21:14:19] (03CR) 10Clare Ming: [C: 03+2] New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809689 (https://phabricator.wikimedia.org/T311665) (owner: 10Bartosz Dziewoński) [21:15:08] !log cjming@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:809671|Stop setting wgBabelCentralApi (T257079)]] (duration: 03m 30s) [21:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:21] zabe: both your patches should be live [21:15:29] thanks :) [21:17:04] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [21:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:35] MatmaRex: I +2'd your 1st 2 patches -- do the 2nd 2 need to be rebased after merge or can i go ahead and merge them now too? [21:18:01] cjming: no, you can +2 them, they should get merged [21:18:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:18:10] (03CR) 10Clare Ming: [C: 03+2] New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809690 (https://phabricator.wikimedia.org/T311597) (owner: 10Bartosz Dziewoński) [21:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:13] (03CR) 10Clare Ming: [C: 03+2] New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809691 (https://phabricator.wikimedia.org/T311597) (owner: 10Bartosz Dziewoński) [21:18:30] thanks [21:19:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:19:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:23] (03Merged) 10jenkins-bot: New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809688 (https://phabricator.wikimedia.org/T311665) (owner: 10Bartosz Dziewoński) [21:20:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw1496.mgmt.eqiad.wmnet with reboot policy FORCED [21:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:08] (03Merged) 10jenkins-bot: New topic hint: Avoid error about section editing when opened from diff [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809689 (https://phabricator.wikimedia.org/T311665) (owner: 10Bartosz Dziewoński) [21:21:14] RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:27] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1496.eqiad.wmnet with OS buster [21:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1496.eqiad.wmnet with OS buster [21:22:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1475.eqiad.wmnet with OS buster [21:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1475.eqiad.wmnet with OS buster [21:23:50] MatmaRex: your 1st 2 patches are on mwdebug1002 if they're testable [21:23:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host mw1477.mgmt.eqiad.wmnet with reboot policy FORCED [21:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2153.codfw.wmnet with OS bullseye [21:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:26] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2153.codfw.wmnet with OS bul... [21:24:39] cjming: yep, looking [21:25:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:24] (03Merged) 10jenkins-bot: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809690 (https://phabricator.wikimedia.org/T311597) (owner: 10Bartosz Dziewoński) [21:25:27] (03Merged) 10jenkins-bot: New topic hint: Add clear:both [extensions/DiscussionTools] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809691 (https://phabricator.wikimedia.org/T311597) (owner: 10Bartosz Dziewoński) [21:25:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:26:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:19] cjming: looks good [21:26:26] syncing [21:26:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:02] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:36] MatmaRex: and your last 2 patches are up on mwdebug1002 [21:29:04] cjming: also looks good! [21:29:20] cool - will sync those as well then -- i'll ping you here when all of them are live [21:30:09] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.17/extensions/DiscussionTools/modules/NewTopicController.js: Backport: [[gerrit:809688|New topic hint: Avoid error about section editing when opened from diff (T311665)]] (duration: 03m 35s) [21:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2154.codfw.wmnet with OS bullseye [21:30:15] T311665: Clicking New section and then legacy mode from diff view gives a confusing error message "Section editing not supported" - https://phabricator.wikimedia.org/T311665 [21:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:22] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS... [21:30:47] (03PS1) 10Cwhite: beta-logs: add minimal grafana config [puppet] - 10https://gerrit.wikimedia.org/r/809706 (https://phabricator.wikimedia.org/T222826) [21:31:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:01] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:25] PROBLEM - Host db1173.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:32:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:32:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1496.eqiad.wmnet with reason: host reimage [21:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1475.eqiad.wmnet with reason: host reimage [21:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:07] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/DiscussionTools/modules/NewTopicController.js: Backport: [[gerrit:809689|New topic hint: Avoid error about section editing when opened from diff (T311665)]] (duration: 03m 43s) [21:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1025.eqiad.wmnet with OS buster [21:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS bus... [21:35:55] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1006.eqiad.wmnet with OS bullseye [21:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex... [21:36:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1496.eqiad.wmnet with reason: host reimage [21:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw1477.mgmt.eqiad.wmnet with reboot policy FORCED [21:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:29] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Jclark-ctr) Replaced Dimm A7 powered host on [21:37:38] RECOVERY - Host db1173.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [21:37:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host mw1477.eqiad.wmnet with OS buster [21:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:51] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.17/extensions/DiscussionTools/modules/dt.ui.NewTopicController.less: Backport: [[gerrit:809690|New topic hint: Add clear:both (T311597)]] (duration: 03m 24s) [21:37:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host mw1477.eqiad.wmnet with OS buster [21:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:56] T311597: Legacy section=new hint may overlap with floating boxes - https://phabricator.wikimedia.org/T311597 [21:37:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1475.eqiad.wmnet with reason: host reimage [21:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:39:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:37] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.18/extensions/DiscussionTools/modules/dt.ui.NewTopicController.less: Backport: [[gerrit:809691|New topic hint: Add clear:both (T311597)]] (duration: 03m 27s) [21:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1031.eqiad.wmnet with OS buster [21:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1031.eqiad.wmnet with OS bus... [21:42:11] MatmaRex: all your changes should be live [21:42:17] thanks! [21:42:30] np! [21:42:54] !log end of UTC late backport window [21:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) [21:43:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10Cmjohnson) [21:44:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) 05Open→03Resolved pinging @andrew to notify the task has been resolved [21:48:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1477.eqiad.wmnet with reason: host reimage [21:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:08] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2154.codfw.wmnet with reason: host reimage [21:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:54] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:52:11] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1477.eqiad.wmnet with reason: host reimage [21:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:23] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: host reimage [21:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1496.eqiad.wmnet with OS buster [21:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1496.eqiad.wmnet with OS buster completed: - mw1496 (**PASS**) -... [21:58:16] (03CR) 10Cwhite: [C: 03+2] beta-logs: add minimal grafana config [puppet] - 10https://gerrit.wikimedia.org/r/809706 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [22:02:56] (03PS1) 10Cwhite: loki: add ferm rule to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) [22:07:08] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:11] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1475.eqiad.wmnet with OS buster [22:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1475.eqiad.wmnet with OS buster completed: - mw1475 (**PASS**) -... [22:09:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) @BTullis Robh figured out the workaround to get the right raid volume to boot first. I tried on an-presto1006 and everything seeme... [22:11:50] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Cmjohnson) [22:14:15] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Cmjohnson) 05Open→03Resolved @Jclark-ctr completed the task, we will send the broken parts back to Dell [22:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:16:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED [22:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:06] (03PS1) 10Ahmon Dancy: Allow mwbuilder group to access mwdeploy key [puppet] - 10https://gerrit.wikimedia.org/r/809712 (https://phabricator.wikimedia.org/T310395) [22:19:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Cmjohnson) [22:19:48] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:19:58] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:22:02] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:25:08] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [22:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2154.codfw.wmnet with OS bullseye [22:26:02] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:07] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye completed: - db2... [22:27:18] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (33) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, cloudvirt1051, cloudvirt1052, db2154, gitlab1001, gitlab1003, gitlab1004, gitlab2001, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe20 [22:27:18] e2012, mw1475, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:29:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:53] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1477.eqiad.wmnet with OS buster [22:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host mw1477.eqiad.wmnet with OS buster completed: - mw1477 (**PASS**) -... [22:33:16] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [22:33:53] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:34:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet1006.mgmt.eqiad.wmnet with reboot policy FORCED [22:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Cmjohnson) [22:36:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10Cmjohnson) 05Open→03Resolved @Dzahn I am not sure if this is you but these are installed. Resolving the task [22:37:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED [22:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:15] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:41:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bullseye [22:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:38] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye [22:58:12] (03PS1) 10BryanDavis: striker: connect docker container directly to host network [puppet] - 10https://gerrit.wikimedia.org/r/809714 (https://phabricator.wikimedia.org/T306469) [22:59:36] (03PS1) 10Cmjohnson: Adding new cloudnet, cloudrabbit and cloudservice nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809715 (https://phabricator.wikimedia.org/T304888) [23:00:19] (03CR) 10Cmjohnson: [C: 03+2] Adding new cloudnet, cloudrabbit and cloudservice nodes to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/809715 (https://phabricator.wikimedia.org/T304888) (owner: 10Cmjohnson) [23:01:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2155.codfw.wmnet with reason: host reimage [23:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:39] (03CR) 10BryanDavis: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/36132/" [puppet] - 10https://gerrit.wikimedia.org/r/809714 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [23:04:17] where do logs go on a production host? I'm trying to look at logs emitted to the 'Parsoid' channel on a parsoid host (i think probably info or others that aren't in logstash). [23:05:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2155.codfw.wmnet with reason: host reimage [23:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:14] i logged onto wtp1046 and looked at /var/log/mediawiki/ but nothing useful there. [23:07:11] will look on wikitech for docs [23:08:22] subbu: mwlog1002 /srv/mw-log [23:08:40] yes, found it. :) thanks. [23:19:28] (03CR) 10BryanDavis: striker: connect docker container directly to host network (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809714 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [23:23:09] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:59] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:14] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648 [23:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:20] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [23:34:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED [23:34:05] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED [23:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:23] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:40:30] Did something change with commons y'day? So, I start looking at a spike in volume of Parsoid events (not fatal, or errors) by 2x as of y'day ... found that for commonswiki slow-parsoid parses (> 3s) spiked by 15x since y'day (but note that new code didn't go out to Parsoid hosts till today) ... see https://logstash.wikimedia.org/goto/bff9cfa00d556f53c143780858d84adf [23:41:36] But, turns out this is also the same (5x spike) with the core/legacy parser. https://logstash.wikimedia.org/goto/09e7b301cbc0451e32e9b90246ad27ba [23:45:59] TimStarling, Krinkle ^ fyi. [23:47:51] that is concerning [23:49:15] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:50:22] Since commons didn't get new code till today, it couldn't be a code change. [23:50:46] could be a commons template or module change [23:50:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db2155.codfw.wmnet with OS bullseye [23:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:57] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye completed: - db2... [23:51:00] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye executed with er... [23:51:16] aah ... okay ... so, it is just triggering a large volume of parses likely. [23:51:17] previewing one affected page, I can reproduce 4.4s total as in the logs, of which 3.2s is Lua [23:52:18] https://commons.wikimedia.org/wiki/Category:Ahmedabad -- note right floating infobox [23:52:23] (03PS1) 10Andrew Bogott: wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) [23:53:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2154.codfw.wmnet with OS bullseye [23:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:48] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye [23:53:55] (03CR) 10CI reject: [V: 04-1] wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: 10Andrew Bogott) [23:55:31] we really need a template profiler [23:55:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2157.codfw.wmnet with OS bullseye [23:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:43] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2157.codfw.wmnet with OS bullseye [23:55:52] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2154.codfw.wmnet with OS bullseye [23:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:57] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye executed with er... [23:56:01] RECOVERY - Check systemd state on thanos-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:56:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bullseye [23:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:00] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye [23:58:52] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul)