[00:16:15] 10SRE, 10MW-on-K8s, 10MediaWiki-Platform-Team, 10MediaWiki-extensions-CentralAuth, and 4 others: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} - https://phabricator.wikimedia.org/T342201 (10Tgr) >>! In T34220... [00:18:54] RECOVERY - Druid overlord on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [00:28:27] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1025.eqiad.wmnet [00:28:36] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1025.eqiad.wmnet [00:28:52] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1033.eqiad.wmnet [00:28:58] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1033.eqiad.wmnet [00:29:31] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1033.eqiad.wmnet [00:38:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1033.eqiad.wmnet [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962229 [00:38:30] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962229 (owner: 10TrainBranchBot) [00:46:32] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962229 (owner: 10TrainBranchBot) [00:57:15] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T347919 (10phaultfinder) [01:06:27] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1025.eqiad.wmnet [01:06:32] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1025.eqiad.wmnet [01:15:13] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [01:15:34] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [01:18:15] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1025.eqiad.wmnet [01:21:32] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1025.eqiad.wmnet [01:22:58] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [01:23:09] ugh, sorry cwhite. unlocked now, finally [01:29:20] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [01:30:32] (03CR) 10Andrew Bogott: [C: 03+2] Add radosgw apis to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/962707 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [01:32:46] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [01:33:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1025.eqiad.wmnet [01:33:39] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1025.eqiad.wmnet [01:34:38] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1033.eqiad.wmnet [01:35:16] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1033.eqiad.wmnet [01:36:52] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [01:48:10] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1033.eqiad.wmnet [01:48:11] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1033.eqiad.wmnet [01:49:30] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:53:13] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:54:10] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0200) [02:07:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.29 [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/962230 (https://phabricator.wikimedia.org/T347080) [02:07:38] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.29 [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/962230 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [02:07:48] (03PS5) 10Krinkle: noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) [02:07:51] (03CR) 10Krinkle: [C: 03+2] noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle) [02:07:55] (03PS3) 10Krinkle: noc: Add new format=json to wiki.php, and JSON button to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961869 (https://phabricator.wikimedia.org/T343398) [02:07:57] (03CR) 10Krinkle: [C: 03+2] noc: Add new format=json to wiki.php, and JSON button to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961869 (https://phabricator.wikimedia.org/T343398) (owner: 10Krinkle) [02:09:10] (03CR) 10CI reject: [V: 04-1] noc: Add new format=json to wiki.php, and JSON button to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961869 (https://phabricator.wikimedia.org/T343398) (owner: 10Krinkle) [02:09:12] (03Merged) 10jenkins-bot: noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle) [02:09:14] (03Merged) 10jenkins-bot: noc: Add new format=json to wiki.php, and JSON button to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961869 (https://phabricator.wikimedia.org/T343398) (owner: 10Krinkle) [02:10:24] (03PS5) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 [02:14:32] (03PS1) 10Krinkle: noc: Fix undefined function str_starts_with on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962742 [02:14:41] (03CR) 10Krinkle: [C: 03+2] noc: Fix undefined function str_starts_with on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962742 (owner: 10Krinkle) [02:15:29] (03Merged) 10jenkins-bot: noc: Fix undefined function str_starts_with on db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962742 (owner: 10Krinkle) [02:17:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:17:51] !log krinkle@deploy2002 Synchronized docroot/noc/: (no justification provided) (duration: 08m 03s) [02:22:02] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [02:22:34] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.29 [core] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/962230 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [02:22:35] (KubernetesAPILatency) resolved: (18) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:25:08] RECOVERY - Check systemd state on dumpsdata1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:07] (03PS1) 10Andrew Bogott: radosgw: include a few missing pieces for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/962743 (https://phabricator.wikimedia.org/T276961) [02:30:25] (03CR) 10Krinkle: [C: 03+2] Profiler: Enable logging of caught Redis exceptions to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962725 (https://phabricator.wikimedia.org/T347916) (owner: 10Krinkle) [02:30:27] (03CR) 10Krinkle: [C: 03+2] Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 (owner: 10Krinkle) [02:31:10] (03Merged) 10jenkins-bot: Profiler: Enable logging of caught Redis exceptions to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962725 (https://phabricator.wikimedia.org/T347916) (owner: 10Krinkle) [02:31:13] (03Merged) 10jenkins-bot: Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 (owner: 10Krinkle) [02:33:15] (03CR) 10Andrew Bogott: [C: 03+2] radosgw: include a few missing pieces for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/962743 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [02:33:38] !log krinkle@deploy2002 Started scap: (no justification provided) [02:34:06] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [02:38:47] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:41:12] !log krinkle@deploy2002 Finished scap: (no justification provided) (duration: 07m 34s) [02:41:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:44:51] (03PS1) 10Andrew Bogott: Remove profile::cloudceph::client::rbd_glance [puppet] - 10https://gerrit.wikimedia.org/r/962744 (https://phabricator.wikimedia.org/T276961) [02:46:34] (KubernetesAPILatency) resolved: (16) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:49:00] (03CR) 10Andrew Bogott: [C: 03+2] Remove profile::cloudceph::client::rbd_glance [puppet] - 10https://gerrit.wikimedia.org/r/962744 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [03:00:06] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0300) [03:01:20] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:53] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:42] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:05] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:21:38] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:22:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:28:12] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:40:14] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:41:12] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [03:46:10] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [03:46:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [03:46:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T343198)', diff saved to https://phabricator.wikimedia.org/P52802 and previous config saved to /var/cache/conftool/dbconfig/20231003-034640-arnaudb.json [03:46:44] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [04:05:26] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [04:05:35] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [04:05:44] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [04:07:55] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [04:08:06] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [04:08:14] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [04:09:00] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [04:09:56] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [04:10:48] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [04:11:32] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [04:11:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:12:27] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [04:13:17] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [04:16:38] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:20:24] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [04:20:41] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [04:20:55] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [04:21:38] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:50:42] (03PS1) 10Ilias Sarantopoulos: httpbb(liftwing): remove deprecated servers from tests [puppet] - 10https://gerrit.wikimedia.org/r/962752 [05:46:28] Is it OK to deploy MinT in a few minutes? [05:49:30] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:08] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on druid1009.eqiad.wmnet with reason: Downtime as we setup the host to join the druid and zookeper cluster [05:52:22] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on druid1009.eqiad.wmnet with reason: Downtime as we setup the host to join the druid and zookeper cluster [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0600) [06:00:04] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0600). [06:17:42] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:58] (03CR) 10Kevin Bazira: [C: 03+1] "Thank you for cleaning these up!" [puppet] - 10https://gerrit.wikimedia.org/r/962752 (owner: 10Ilias Sarantopoulos) [06:31:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:53] * kart_ will update MinT; seems OK to go ahead.. [06:40:21] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-09-28-043052-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961977 (https://phabricator.wikimedia.org/T343450) (owner: 10KartikMistry) [06:41:18] (03Merged) 10jenkins-bot: Update MinT to 2023-09-28-043052-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961977 (https://phabricator.wikimedia.org/T343450) (owner: 10KartikMistry) [06:42:34] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:42:41] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:42:59] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:45:02] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:32] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:46:19] (03PS1) 10Ayounsi: Add "Auto-Submitted" header to dbbackup scripts [puppet] - 10https://gerrit.wikimedia.org/r/962940 (https://phabricator.wikimedia.org/T347835) [06:51:48] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:56:00] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:59:28] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [07:00:05] Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0700). [07:00:05] Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:31] Hi :) [07:02:39] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) The important takeaway from this (as per our discussion) was this bit: //Google doesn't guarantee that it will cr... [07:03:18] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:03:45] !log Updated MinT to 2023-09-28-043052-production (T343450, T341478) [07:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:59] T343450: Enable MinT for closely-related languages based on community input - https://phabricator.wikimedia.org/T343450 [07:03:59] T341478: Port the markup transfer feature of cxserver to MinT - https://phabricator.wikimedia.org/T341478 [07:04:10] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:11:09] Need a prioritary deployment for the throttle exemption patch (Editathon just started) - if someone can deploy please lt me know :) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/962729 [07:12:00] (03PS6) 10Superpes15: [fiwiki] Add an editautoreviewprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069) [07:13:58] Superpes: hey, looking [07:14:31] Hi taavi thanks ;) [07:15:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962729 (https://phabricator.wikimedia.org/T347874) (owner: 10Superpes15) [07:15:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069) (owner: 10Superpes15) [07:16:42] (03Merged) 10jenkins-bot: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962729 (https://phabricator.wikimedia.org/T347874) (owner: 10Superpes15) [07:16:45] (03Merged) 10jenkins-bot: [fiwiki] Add an editautoreviewprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069) (owner: 10Superpes15) [07:17:46] uhhh [07:21:09] scap backport is having an issue, so I'm doing this the old-fashioned way [07:21:33] Uhhhhhhhhh :O [07:22:11] Is something down? I see also "Error while fetching results for wm-zuul-status: TypeError: Failed to fetch" :/ [07:23:51] that's probably unrelated.. [07:24:53] Lol [07:27:36] !log taavi@deploy2002 Started scap: T347874 and T347069 [07:27:41] T347069: Add new autoreview page protection level for Finnish Wikipedia - https://phabricator.wikimedia.org/T347069 [07:27:41] T347874: Mass account creation request throttling exception (3rd October 2023) - https://phabricator.wikimedia.org/T347874 [07:30:18] ^ I'm fairly sure this will be taking a while [07:31:33] Np :P I'm only doing an internship in an hospital but there are few people now lol [07:39:07] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: remove thanos components from thanos::frontend role [puppet] - 10https://gerrit.wikimedia.org/r/956906 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [07:40:01] !log taavi@deploy2002 taavi: T347874 and T347069 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:40:07] T347069: Add new autoreview page protection level for Finnish Wikipedia - https://phabricator.wikimedia.org/T347069 [07:40:08] T347874: Mass account creation request throttling exception (3rd October 2023) - https://phabricator.wikimedia.org/T347874 [07:40:15] Superpes: please test [07:41:56] Looks fine taavi [07:42:19] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1001.eqiad.wmnet with OS bullseye [07:42:27] 10SRE-swift-storage, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was s... [07:42:39] !log taavi@deploy2002 taavi: Continuing with sync [07:48:07] (03PS1) 10Majavah: hieradata: cloud: provision wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/962992 [07:49:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:50:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:51:31] 10Puppet, 10Cloud-VPS, 10cloud-services-team: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 (10taavi) [07:52:12] 10Puppet, 10Cloud-VPS, 10cloud-services-team: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 (10taavi) Seems to be caused by https://gerrit.wikimedia.org/... [07:52:25] (03PS2) 10Majavah: hieradata: cloud: provision wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934) [07:53:26] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43821/console" [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934) (owner: 10Majavah) [07:54:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:55:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:55:53] (03CR) 10Elukey: [C: 03+2] httpbb(liftwing): remove deprecated servers from tests [puppet] - 10https://gerrit.wikimedia.org/r/962752 (owner: 10Ilias Sarantopoulos) [07:56:58] !log taavi@deploy2002 Finished scap: T347874 and T347069 (duration: 29m 22s) [07:57:03] T347069: Add new autoreview page protection level for Finnish Wikipedia - https://phabricator.wikimedia.org/T347069 [07:57:03] T347874: Mass account creation request throttling exception (3rd October 2023) - https://phabricator.wikimedia.org/T347874 [07:57:10] Superpes: and it's finally live [07:57:34] Oh wow taavi :O Thanks :3 [07:57:48] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:59:51] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1001.eqiad.wmnet with reason: host reimage [08:00:08] Just for confirmation... Have you run resetAuthenticationThrottle.php? taavi [08:00:24] oh, right, I totally forgot [08:00:30] looking [08:00:33] (ProbeDown) resolved: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:38] Lol https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [08:01:41] !log taavi@mwmaint2002 ~ $ mwscript resetAuthenticationThrottle.php --wiki=enwiki --signup --ip=155.232.7.202 # T347874 [08:01:44] now I have [08:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:57] Thanks again for your time :3 [08:03:04] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1001.eqiad.wmnet with reason: host reimage [08:03:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:07:25] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-thanos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:51] expected ^ thanos-fe reimage in progress [08:08:57] (03PS1) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 [08:09:42] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [08:09:43] (03PS2) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) [08:09:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [08:11:00] (03CR) 10Filippo Giunchedi: [C: 03+1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [08:12:13] (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [08:12:35] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:13:02] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [08:14:23] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:15:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:39] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [08:17:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:17:38] 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything - https://phabricator.wikimedia.org/T347936 (10Fabfur) [08:17:50] (03PS3) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) [08:18:21] (03CR) 10CI reject: [V: 04-1] dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) (owner: 10Majavah) [08:19:26] (03PS1) 10Cathal Mooney: Move 185.15.57.8/29 to netbox-controlled DNS records [dns] - 10https://gerrit.wikimedia.org/r/963000 (https://phabricator.wikimedia.org/T347687) [08:19:32] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [08:19:44] (03CR) 10Volans: [C: 03+1] "LGTM, optional alternative solution inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [08:19:53] (03PS4) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) [08:20:21] (03PS6) 10Ilias Sarantopoulos: ml-alerts: add alert for increased ORESFetchScoreJob [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [08:20:36] (03PS5) 10Majavah: dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) [08:24:11] (03PS1) 10Fabfur: purged: switch to unix socket for varnish requests [puppet] - 10https://gerrit.wikimedia.org/r/963002 (https://phabricator.wikimedia.org/T347837) [08:24:39] (03CR) 10Cathal Mooney: [C: 03+2] Move 185.15.57.8/29 to netbox-controlled DNS records [dns] - 10https://gerrit.wikimedia.org/r/963000 (https://phabricator.wikimedia.org/T347687) (owner: 10Cathal Mooney) [08:25:09] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-thanos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:25] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1001.eqiad.wmnet with OS bullseye [08:26:30] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [08:27:40] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ml-staging2001.codfw.wmnet with reason: Check chassis internals for GPU hosting [08:27:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ml-staging2001.codfw.wmnet with reason: Check chassis internals for GPU hosting [08:30:23] (03Abandoned) 10Fabfur: purged: switch to unix socket for varnish requests [puppet] - 10https://gerrit.wikimedia.org/r/963002 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [08:30:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "The PCC output seems to suggest this will break in deployment-prep, I'm not sure why though." [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [08:30:38] (03PS1) 10Gerrit maintenance bot: Add fon to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/962231 (https://phabricator.wikimedia.org/T347935) [08:32:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:59] (03PS1) 10Fabfur: purged: use unix socket for varnish in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) [08:34:45] 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [08:35:16] 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10Aklapper) [08:38:52] (03PS2) 10Elukey: Remove ores.svc.{eqiad,codfw}.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/962642 (https://phabricator.wikimedia.org/T347278) [08:40:51] (03CR) 10David Caro: "LGTM, is there an easy way to test this?" [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) (owner: 10Majavah) [08:40:59] (03CR) 10David Caro: [C: 03+1] dynamicproxy: delete backends when deleting route [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) (owner: 10Majavah) [08:41:09] (03CR) 10Elukey: [C: 03+2] Remove ores.svc.{eqiad,codfw}.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/962642 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [08:44:04] (03CR) 10Majavah: [C: 03+2] dynamicproxy: delete backends when deleting route (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/962998 (https://phabricator.wikimedia.org/T347883) (owner: 10Majavah) [08:45:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:38] (03PS2) 10Ladsgroup: Add fon to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/962231 (https://phabricator.wikimedia.org/T347935) (owner: 10Gerrit maintenance bot) [08:48:43] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add fon to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/962231 (https://phabricator.wikimedia.org/T347935) (owner: 10Gerrit maintenance bot) [08:50:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:54:44] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934) (owner: 10Majavah) [08:55:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:55:56] (03PS1) 10Ilias Sarantopoulos: ml-services: return 400 when requesting callbacks on ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963005 (https://phabricator.wikimedia.org/T347663) [08:58:17] (03CR) 10Elukey: [C: 03+1] ml-services: return 400 when requesting callbacks on ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963005 (https://phabricator.wikimedia.org/T347663) (owner: 10Ilias Sarantopoulos) [08:58:19] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [09:00:10] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: return 400 when requesting callbacks on ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963005 (https://phabricator.wikimedia.org/T347663) (owner: 10Ilias Sarantopoulos) [09:01:00] (03Merged) 10jenkins-bot: ml-services: return 400 when requesting callbacks on ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963005 (https://phabricator.wikimedia.org/T347663) (owner: 10Ilias Sarantopoulos) [09:05:45] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:06:13] !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:06:39] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:07:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934) (owner: 10Majavah) [09:07:18] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1002.eqiad.wmnet with OS bullseye [09:07:24] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [09:09:13] (03CR) 10Majavah: [V: 03+1 C: 03+2] hieradata: cloud: provision wmf-ca-certificates.crt [puppet] - 10https://gerrit.wikimedia.org/r/962992 (https://phabricator.wikimedia.org/T347934) (owner: 10Majavah) [09:09:28] (03CR) 10Klausman: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman) [09:09:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43826/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [09:09:59] (03PS2) 10Klausman: VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007 [09:13:41] (03CR) 10Elukey: VIPs: add DNS entries for new recommendation-api-ng service (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman) [09:14:10] 10Puppet, 10Cloud-VPS, 10cloud-services-team: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 (10taavi) 05Open→03Resolved a:03taavi [09:14:43] thanks taavi [09:15:46] jouncebot: nowandnext [09:15:46] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [09:15:46] In 0 hour(s) and 44 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1000) [09:15:49] cool [09:16:04] (03PS3) 10Klausman: VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007 [09:16:21] (03CR) 10Klausman: VIPs: add DNS entries for new recommendation-api-ng service (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman) [09:16:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ditto as the other patch, as-is will conflict but tested and LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [09:16:58] (03PS7) 10Ilias Sarantopoulos: ml-alerts: add alert for increased ORESFetchScoreJob [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [09:18:12] (03CR) 10CI reject: [V: 04-1] ml-alerts: add alert for increased ORESFetchScoreJob [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [09:19:20] (03PS1) 10Ladsgroup: Init config of Fon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963010 (https://phabricator.wikimedia.org/T347935) [09:20:56] (03CR) 10Ladsgroup: [C: 03+2] Init config of Fon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963010 (https://phabricator.wikimedia.org/T347935) (owner: 10Ladsgroup) [09:21:43] (03Merged) 10jenkins-bot: Init config of Fon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963010 (https://phabricator.wikimedia.org/T347935) (owner: 10Ladsgroup) [09:22:56] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/963009 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [09:23:21] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::pdu_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961865 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [09:23:23] (03CR) 10Elukey: [C: 03+1] conftool-data: Add entry for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/963009 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [09:23:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "Will conflict but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [09:23:56] (03CR) 10Klausman: [C: 03+2] conftool-data: Add entry for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/963009 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [09:24:01] (03CR) 10Gehel: [C: 03+1] "trivial enough" [deployment-charts] - 10https://gerrit.wikimedia.org/r/962694 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse) [09:24:37] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1002.eqiad.wmnet with reason: host reimage [09:26:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "Will conflict but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [09:26:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [09:26:44] !log Draining kubernetes2010.codfw.wmnet for reboot to change BIOS setting [09:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [09:27:07] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1002.eqiad.wmnet with reason: host reimage [09:27:36] (03PS1) 10Ladsgroup: Add fonwiki to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963012 (https://phabricator.wikimedia.org/T347935) [09:27:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:27:51] (03CR) 10Klausman: ml-alerts: add alert for increased ORESFetchScoreJob (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [09:28:06] (03CR) 10Ladsgroup: [C: 03+2] Add fonwiki to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963012 (https://phabricator.wikimedia.org/T347935) (owner: 10Ladsgroup) [09:28:40] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubernetes2010.codfw.wmnet with reason: BIOS setting change [09:28:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubernetes2010.codfw.wmnet with reason: BIOS setting change [09:29:09] (03CR) 10Klausman: ml-alerts: add alert for increased ORESFetchScoreJob (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [09:29:37] (03CR) 10Elukey: ml-alerts: add alert for increased ORESFetchScoreJob (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [09:29:46] (03Merged) 10jenkins-bot: Add fonwiki to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963012 (https://phabricator.wikimedia.org/T347935) (owner: 10Ladsgroup) [09:30:28] !log ladsgroup@deploy2002 Started scap: Creating fonwiki (T347935) [09:30:33] T347935: Create Wikipedia Fon - https://phabricator.wikimedia.org/T347935 [09:31:13] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 84.46 ms [09:32:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:34:23] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) a:03Ladsgroup I think the name must be created under `glam-eu` instead. See https://meta.wikimedia.org/wiki/Mailing_lists/Standardization and on top glam-us and glam... [09:35:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43827/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [09:35:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:36:03] That's me ^ [09:36:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:37:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:37:19] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:37:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:38:03] !log ladsgroup@deploy2002 Finished scap: Creating fonwiki (T347935) (duration: 07m 34s) [09:38:06] T347935: Create Wikipedia Fon - https://phabricator.wikimedia.org/T347935 [09:38:17] (03PS8) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [09:38:31] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:38:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:39:23] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: bump image version to flink-1.16.1-rdf-0.3.133 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962694 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse) [09:39:41] (03CR) 10CI reject: [V: 04-1] ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [09:40:08] (03Merged) 10jenkins-bot: rdf-streaming-updater: bump image version to flink-1.16.1-rdf-0.3.133 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962694 (https://phabricator.wikimedia.org/T347515) (owner: 10DCausse) [09:40:18] 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10fgiunchedi) I checked the dashboard version history and I believe this was caused by the prometheus `global` deprecation from a while back. The easiest fix is to... [09:41:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:42:19] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 173, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:42:23] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [09:42:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:42:53] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [09:42:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:43:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:43:59] (03CR) 10Jbond: puppet: Add new PuppetServer class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [09:44:04] (03PS10) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [09:45:04] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1002.eqiad.wmnet with OS bullseye [09:45:09] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [09:46:39] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Clement_Goubert) Hi @Jhancock.wm just a heads up, I rebooted kubernetes2010 to change the CPU power management BIOS setting that was set to BIOS control instead of OS control, which meant we could... [09:49:24] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2010.codfw.wmnet [09:49:24] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2010.codfw.wmnet [09:49:30] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:55] !log Uncordoned kubernetes2010.codfw.wmnet [09:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:10] 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I've fixed the datasource and queries, so the dashboard now loads data again! The labels/legends might need so... [09:54:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43829/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [09:56:09] (03CR) 10Jbond: [C: 03+2] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [09:56:27] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) If that name is okay with you, let me know and I create the mailing list. [09:59:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43830/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1000) [10:00:10] (03Merged) 10jenkins-bot: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [10:01:20] (03CR) 10Jbond: [C: 03+2] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:01:24] (03CR) 10Jbond: [C: 03+2] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:01:28] (03CR) 10Jbond: [C: 03+2] prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:01:31] (03CR) 10Jbond: [C: 03+2] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:04:31] (03PS14) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [10:06:00] (03PS8) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [10:06:38] (03CR) 10CI reject: [V: 04-1] prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:09:18] (03PS10) 10Jbond: prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) [10:09:55] (03CR) 10CI reject: [V: 04-1] prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:10:54] (03PS15) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [10:11:55] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43833/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [10:13:36] (03PS3) 10Majavah: P:cloudceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/961783 [10:14:56] 10SRE, 10Observability-Metrics: Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10Fabfur) Thank you very much! [10:15:21] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1003.eqiad.wmnet with OS bullseye [10:15:26] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [10:19:30] !log vgutierrez@cumin1001 START - Cookbook sre.dns.netbox [10:20:34] (03PS11) 10Jbond: prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) [10:22:32] (03PS1) 10Jbond: Revert "get_clusters: remove legacy functions" [puppet] - 10https://gerrit.wikimedia.org/r/962221 [10:22:36] (03PS1) 10Jbond: Revert "prometheus: switch to wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/962222 [10:22:39] (03PS1) 10Jbond: Revert "P:prometheus::ops: convert to using wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/962223 [10:22:42] (03PS1) 10Jbond: Revert "wmflib::get_clusters: create a puppet version of get_clu..." [puppet] - 10https://gerrit.wikimedia.org/r/962224 [10:23:35] (03PS2) 10Stevemunene: druid: Bring druid1010.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962248 (https://phabricator.wikimedia.org/T336042) [10:23:37] (03PS2) 10Stevemunene: druid: Bring druid1011.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962249 (https://phabricator.wikimedia.org/T336042) [10:23:41] (03PS2) 10Stevemunene: druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042) [10:26:06] (03CR) 10CI reject: [V: 04-1] Revert "get_clusters: remove legacy functions" [puppet] - 10https://gerrit.wikimedia.org/r/962221 (owner: 10Jbond) [10:30:57] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:32:23] !log vgutierrez@cumin1001 START - Cookbook sre.dns.netbox [10:32:36] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: new_rings.tar.bz2 not found after host reimage - https://phabricator.wikimedia.org/T347964 (10fgiunchedi) [10:32:43] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1003.eqiad.wmnet with reason: host reimage [10:34:51] !log vgutierrez@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix katran-test.svc.eqiad.wmnet IP allocation - vgutierrez@cumin1001" [10:35:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [10:35:59] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1003.eqiad.wmnet with reason: host reimage [10:36:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix katran-test.svc.eqiad.wmnet IP allocation - vgutierrez@cumin1001" [10:36:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:36:21] (03PS1) 10Majavah: P:toolforge: install toolforge-builds-cli [puppet] - 10https://gerrit.wikimedia.org/r/963018 (https://phabricator.wikimedia.org/T334588) [10:36:52] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "get_clusters: remove legacy functions" [puppet] - 10https://gerrit.wikimedia.org/r/962221 (owner: 10Jbond) [10:36:56] (03CR) 10Jbond: [C: 03+2] Revert "prometheus: switch to wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/962222 (owner: 10Jbond) [10:37:00] (03CR) 10Jbond: [C: 03+2] Revert "P:prometheus::ops: convert to using wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/962223 (owner: 10Jbond) [10:37:04] (03CR) 10Jbond: [C: 03+2] Revert "wmflib::get_clusters: create a puppet version of get_clu..." [puppet] - 10https://gerrit.wikimedia.org/r/962224 (owner: 10Jbond) [10:37:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:38:40] (03CR) 10David Caro: [C: 03+1] P:toolforge: install toolforge-builds-cli [puppet] - 10https://gerrit.wikimedia.org/r/963018 (https://phabricator.wikimedia.org/T334588) (owner: 10Majavah) [10:38:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:40:07] (03CR) 10Majavah: [C: 03+2] P:toolforge: install toolforge-builds-cli [puppet] - 10https://gerrit.wikimedia.org/r/963018 (https://phabricator.wikimedia.org/T334588) (owner: 10Majavah) [10:40:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [10:40:50] (03PS2) 10Fabfur: purged: use unix socket for varnish in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) [10:44:03] (03PS1) 10Jbond: Revert "Revert "wmflib::get_clusters: create a puppet version of..." [puppet] - 10https://gerrit.wikimedia.org/r/962225 [10:44:09] (03PS1) 10Jbond: Revert "Revert "P:prometheus::ops: convert to using wmflib::get_..." [puppet] - 10https://gerrit.wikimedia.org/r/963026 [10:44:14] (03PS1) 10Jbond: Revert^2 "prometheus: switch to wmflib::get_clusters" [puppet] - 10https://gerrit.wikimedia.org/r/963027 [10:44:17] (03PS1) 10Jbond: Revert^2 "get_clusters: remove legacy functions" [puppet] - 10https://gerrit.wikimedia.org/r/963028 [10:44:53] (03PS2) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [10:45:13] (03PS2) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [10:45:44] (03PS2) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [10:46:07] (03PS2) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [10:47:15] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43837/console" [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert) [10:51:23] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43839/console" [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [10:54:00] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1003.eqiad.wmnet with OS bullseye [10:54:05] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [10:55:14] (03PS4) 10Clément Goubert: P:mw::deployment::server: Don't alert for train-presync [puppet] - 10https://gerrit.wikimedia.org/r/953200 [10:57:39] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43840/console" [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert) [11:00:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:04:29] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:06:01] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:28] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: new_rings.tar.bz2 not found after host reimage - https://phabricator.wikimedia.org/T347964 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Followup from IRC, this is expected when reimaging the ring... [11:08:32] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10fgiunchedi) [11:08:47] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:56] (03CR) 10Clément Goubert: [V: 03+1] P:mw::deployment::server: Don't alert for train-presync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert) [11:11:43] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [11:11:48] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [11:12:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:13:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:22:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificates) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:24:28] (03PS3) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [11:24:30] (03PS3) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [11:24:32] (03PS3) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [11:24:34] (03PS3) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [11:25:16] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:26:42] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:27:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificates) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:29:02] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack - aborrero@cumin1001" [11:29:50] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1004.eqiad.wmnet with reason: host reimage [11:29:52] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack - aborrero@cumin1001" [11:29:52] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:30:53] (03PS4) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [11:30:55] (03PS4) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [11:30:57] (03PS4) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [11:30:59] (03PS4) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [11:31:03] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:31:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:31:44] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:32:24] (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:32:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:33:05] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1004.eqiad.wmnet with reason: host reimage [11:35:20] jouncebot: nowandnext [11:35:21] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [11:35:21] In 0 hour(s) and 24 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1200) [11:37:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:38:15] (03PS5) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [11:39:08] (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:44:30] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:54] (03CR) 10Vgutierrez: [C: 03+1] purged: use unix socket for varnish in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [11:45:16] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [11:48:28] (03PS2) 10KartikMistry: Update cxserver to 2023-09-28-043003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/961979 (https://phabricator.wikimedia.org/T343450) [11:51:12] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1004.eqiad.wmnet with OS bullseye [11:51:17] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [11:51:30] (03PS6) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [11:52:16] (03CR) 10Fabfur: [V: 03+1 C: 03+2] purged: use unix socket for varnish in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/963004 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [11:54:15] (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:54:21] (03PS1) 10FNegri: Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 [11:54:34] !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963004 (T347837). `purged` daemon will be restarted by puppet in ulsfo in the next 30m [11:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:37] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [11:54:42] (03PS2) 10FNegri: Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 [11:54:46] (03CR) 10CI reject: [V: 04-1] Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (owner: 10FNegri) [11:55:06] 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [11:55:12] (03CR) 10CI reject: [V: 04-1] Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (owner: 10FNegri) [11:56:51] (03PS3) 10FNegri: Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 [11:57:46] (03PS4) 10FNegri: Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)"" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (https://phabricator.wikimedia.org/T341285) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1200) [12:00:56] (03PS2) 10FNegri: Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring" [puppet] - 10https://gerrit.wikimedia.org/r/962207 (https://phabricator.wikimedia.org/T339894) [12:03:12] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10ayounsi) [12:03:40] (03PS1) 10Fabfur: purged: use unix socket for varnish in codfw [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837) [12:05:42] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) Asked Juniper about their timeline on getting this setup. [12:06:26] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, and 2 others: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou) [12:06:48] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43845/console" [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [12:16:18] (03PS7) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [12:16:34] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10JAllemandou) Hi SRE folks, We'd need @SGupta-WMF to be a member of the analytics-admin group so that she can handle ops-week tasks such as deployment and other restarts. Many thanks [12:17:20] (03PS1) 10Btullis: Add Surbhi Gupta to the analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/963022 (https://phabricator.wikimedia.org/T335657) [12:19:09] (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:19:40] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10BTullis) Thanks @JAllemandou - I'm currently listed as an approver for this group, and I'm happy to approve the request :-) [12:23:16] (03CR) 10Joal: [C: 03+1] "Thanks Ben :)" [puppet] - 10https://gerrit.wikimedia.org/r/963022 (https://phabricator.wikimedia.org/T335657) (owner: 10Btullis) [12:23:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:23:59] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2001.codfw.wmnet with OS bullseye [12:24:05] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [12:24:30] (03PS2) 10Sharvaniharan: New donor experience stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960124 [12:26:03] (03CR) 10Brouberol: [C: 03+2] Add Surbhi Gupta to the analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/963022 (https://phabricator.wikimedia.org/T335657) (owner: 10Btullis) [12:28:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:34:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10BTullis) I have also added `sg912` to the LDAP group `wmf` as set out in T335657#9186606 [12:34:52] (03CR) 10Btullis: [C: 03+2] Add Surbhi Gupta to the analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/963022 (https://phabricator.wikimedia.org/T335657) (owner: 10Btullis) [12:35:20] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10ayounsi) I think here the only/best option is to reduce the time delta between when a server is connected and when switch port is configured (line `Run the sr... [12:36:59] (03PS5) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [12:37:01] (03PS5) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [12:37:03] (03PS5) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [12:37:05] (03PS8) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [12:37:51] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:38:40] (03PS2) 10Anzx: arwiki: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563) [12:39:15] Dreamy_Jazz: FYI, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/962623/ (scheduled for next window) has already been deployed, that expected? [12:40:00] Yeah. I was able to get it deployed before that window but didn't have a chance to remove it yet [12:40:17] (03PS3) 10Anzx: arwiki: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563) [12:40:51] (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:40:53] ack :-) [12:41:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T343198)', diff saved to https://phabricator.wikimedia.org/P52803 and previous config saved to /var/cache/conftool/dbconfig/20231003-124141-arnaudb.json [12:41:45] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [12:42:59] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: host reimage [12:45:10] (03PS6) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [12:45:12] (03PS6) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [12:45:14] (03PS6) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [12:45:16] (03PS9) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [12:45:26] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: host reimage [12:45:53] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [12:50:16] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1025.eqiad.wmnet with OS bullseye [12:50:26] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1025.eqiad.wmnet with OS bullseye [12:50:40] (03PS1) 10Filippo Giunchedi: services: fix xenon/arclamp redis egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/963024 (https://phabricator.wikimedia.org/T347926) [12:53:34] (03PS1) 10Anzx: add throttle rules for Ada Lovelace Day October 10, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719) [12:56:44] (03PS2) 10Anzx: add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719) [12:56:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P52804 and previous config saved to /var/cache/conftool/dbconfig/20231003-125647-arnaudb.json [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1300) [13:00:06] sharvani__ and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] * TheresNoTime can deploy [13:00:28] Hi... here for deployment of my patch :-) [13:00:38] sharvani__: o/ will start with yours :-) [13:00:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960124 (owner: 10Sharvaniharan) [13:00:48] (03PS9) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:00:50] o/ [13:00:51] Thank you! [13:01:15] aanzx: quick note, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/962127/ is still marked WIP [13:01:36] (03Merged) 10jenkins-bot: New donor experience stream for apps event schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960124 (owner: 10Sharvaniharan) [13:01:52] !log samtar@deploy2002 Started scap: Backport for [[gerrit:960124|New donor experience stream for apps event schema]] [13:01:57] (03PS4) 10Samtar: arwiki: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563) (owner: 10Anzx) [13:01:58] TheresNoTime: marked as active now [13:02:05] :-) [13:02:11] (03CR) 10CI reject: [V: 04-1] ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:03:08] !log samtar@deploy2002 sharvaniharan and samtar: Backport for [[gerrit:960124|New donor experience stream for apps event schema]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:03:10] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2001.codfw.wmnet with OS bullseye [13:03:14] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [13:03:21] sharvani__: can you test this change on mwdebug? [13:03:30] yes... testing now.. [13:03:47] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1025.eqiad.wmnet with reason: host reimage [13:03:56] Working perfectly! Ty! :) [13:04:04] !log samtar@deploy2002 sharvaniharan and samtar: Continuing with sync [13:04:50] (03PS1) 10Filippo Giunchedi: wmflib: clarify 'params' service::probe parameter [puppet] - 10https://gerrit.wikimedia.org/r/963049 [13:05:08] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ayounsi) [13:05:13] 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic, 10netops: Q4/Q1:knams racking elevations & planning - https://phabricator.wikimedia.org/T331886 (10ayounsi) 05Open→03Resolved a:03RobH I believe this is all done. [13:07:00] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1025.eqiad.wmnet with reason: host reimage [13:07:54] (unrelated to current deploys) There's a lot of `PHP Warning: RedisException: Connection timed out` -spam in logstash, assuming nothing serious but logged just in case at T347987 [13:07:55] T347987: PHP Warning: RedisException: Connection timed out - https://phabricator.wikimedia.org/T347987 [13:08:38] TheresNoTime: I bet that's T347916 and I'm working to fix it in T347926 [13:08:38] T347926: Excimer UI profile lost when requested from mw-on-k8s - https://phabricator.wikimedia.org/T347926 [13:08:39] T347916: Investigate sharp increase in lost Arc Lamp samples (arclamp_client_error.exception) - https://phabricator.wikimedia.org/T347916 [13:09:04] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:09:19] godog: ack, good luck :-) (feel free to close/merge the task if needed) [13:09:48] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:10:18] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:960124|New donor experience stream for apps event schema]] (duration: 08m 26s) [13:10:22] sharvani__: live on prod :) [13:10:42] (03PS2) 10Filippo Giunchedi: services: fix xenon/arclamp redis egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/963024 (https://phabricator.wikimedia.org/T347926) [13:10:43] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Majavah https://phabricator.wikimedia.org/T346948 - The acknowledgement expires at: 2023-11-04 13:10:27. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:10:43] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Majavah https://phabricator.wikimedia.org/T346948 - The acknowledgement expires at: 2023-11-04 13:10:27. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:10:48] Thank you for deploying @TheresNoTime :-) [13:10:49] aanzx: I'm going to do your 962127 and 963025 together if that's okay [13:10:55] (03PS3) 10Samtar: add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719) (owner: 10Anzx) [13:10:59] ok, np [13:11:08] s/harvani__: you're welcome! [13:11:13] TheresNoTime: ack, yeah will link the related tasks and close it, thank you [13:11:51] (03PS1) 10Sg912: Update mediawiki_history snapshot [puppet] - 10https://gerrit.wikimedia.org/r/963050 [13:11:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P52805 and previous config saved to /var/cache/conftool/dbconfig/20231003-131154-arnaudb.json [13:12:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563) (owner: 10Anzx) [13:12:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719) (owner: 10Anzx) [13:12:26] (03PS2) 10Sg912: Updated mediawiki_history snapshot as part of Ops week [puppet] - 10https://gerrit.wikimedia.org/r/963050 [13:12:56] !log taavi@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1002.eqiad.wmnet [13:13:10] (03Merged) 10jenkins-bot: arwiki: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962127 (https://phabricator.wikimedia.org/T347563) (owner: 10Anzx) [13:13:13] (03CR) 10Filippo Giunchedi: [C: 03+2] wmflib: clarify 'params' service::probe parameter [puppet] - 10https://gerrit.wikimedia.org/r/963049 (owner: 10Filippo Giunchedi) [13:13:15] (03Merged) 10jenkins-bot: add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963025 (https://phabricator.wikimedia.org/T347719) (owner: 10Anzx) [13:13:26] !log samtar@deploy2002 Started scap: Backport for [[gerrit:962127|arwiki: add importsources (T347563)]], [[gerrit:963025|add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (T347719)]] [13:13:31] (03CR) 10Elukey: [C: 03+1] VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman) [13:13:38] T347563: Add English Wikipedia to import sources of Arabic Wikipedia - https://phabricator.wikimedia.org/T347563 [13:13:38] T347719: Lift IP caps for Ada Lovelace Day (Oct10, 2023) - https://phabricator.wikimedia.org/T347719 [13:14:45] !log samtar@deploy2002 anzx and samtar: Backport for [[gerrit:962127|arwiki: add importsources (T347563)]], [[gerrit:963025|add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (T347719)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:14:48] (03CR) 10Joal: [C: 03+1] "Thanks Surbhi" [puppet] - 10https://gerrit.wikimedia.org/r/963050 (owner: 10Sg912) [13:14:49] TheresNoTime: testing [13:14:51] aanzx: ready for testing on mwdebug [13:14:52] ack [13:16:08] TheresNoTime: looks good [13:16:15] syncing [13:16:17] !log samtar@deploy2002 anzx and samtar: Continuing with sync [13:17:14] (03PS10) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:17:18] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T347919 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [13:18:27] (03CR) 10CI reject: [V: 04-1] ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:18:53] (03CR) 10Btullis: [C: 03+2] Updated mediawiki_history snapshot as part of Ops week [puppet] - 10https://gerrit.wikimedia.org/r/963050 (owner: 10Sg912) [13:19:31] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [13:19:35] (03PS1) 10Ottomata: mw-page-content-change-enrich - bump to 1.27.0, and set codfw replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963051 (https://phabricator.wikimedia.org/T347676) [13:20:38] (03PS4) 10Klausman: VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007 [13:21:15] (03CR) 10Klausman: [V: 03+2 C: 03+2] VIPs: add DNS entries for new recommendation-api-ng service [dns] - 10https://gerrit.wikimedia.org/r/963007 (owner: 10Klausman) [13:21:45] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [13:22:30] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:962127|arwiki: add importsources (T347563)]], [[gerrit:963025|add throttle rules for Ada Lovelace Day October 10, 2023 and fix throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (T347719)]] (duration: 09m 03s) [13:22:33] aanzx: live in prod :) [13:22:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:36] TheresNoTime: thank you [13:22:42] T347563: Add English Wikipedia to import sources of Arabic Wikipedia - https://phabricator.wikimedia.org/T347563 [13:22:42] T347719: Lift IP caps for Ada Lovelace Day (Oct10, 2023) - https://phabricator.wikimedia.org/T347719 [13:22:54] * TheresNoTime will be around for another ~15m if there's any other patches [13:23:20] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [13:23:20] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:23:20] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt-wdqs1002.eqiad.wmnet [13:23:39] !log taavi@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1003.eqiad.wmnet [13:23:59] (03PS1) 10MVernon: aptrepo: install zip on aptrepo servers [puppet] - 10https://gerrit.wikimedia.org/r/963052 (https://phabricator.wikimedia.org/T304491) [13:24:51] (03PS7) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [13:24:53] (03PS7) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [13:24:55] (03PS7) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [13:24:57] (03PS10) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [13:25:48] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:26:27] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/963052 (https://phabricator.wikimedia.org/T304491) (owner: 10MVernon) [13:26:42] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/963013 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:27:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T343198)', diff saved to https://phabricator.wikimedia.org/P52806 and previous config saved to /var/cache/conftool/dbconfig/20231003-132700-arnaudb.json [13:27:02] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - bump to 1.27.0, and set codfw replicas to 12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/963051 (https://phabricator.wikimedia.org/T347676) (owner: 10Ottomata) [13:27:03] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [13:27:06] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [13:27:27] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [13:27:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T343198)', diff saved to https://phabricator.wikimedia.org/P52807 and previous config saved to /var/cache/conftool/dbconfig/20231003-132733-arnaudb.json [13:27:35] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:27:37] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe2002.codfw.wmnet with OS bullseye [13:27:45] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumi... [13:30:26] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [13:30:28] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:30:41] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:31:07] (03PS8) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [13:31:09] (03PS8) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [13:31:11] (03PS8) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [13:31:13] (03PS11) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [13:32:19] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:32:47] (03CR) 10CI reject: [V: 04-1] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:33:01] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [13:33:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:57] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1026.eqiad.wmnet [13:34:02] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1026.eqiad.wmnet [13:34:10] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt-wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [13:34:10] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:34:11] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt-wdqs1003.eqiad.wmnet [13:34:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1025.eqiad.wmnet with OS bullseye [13:34:25] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:34:31] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1025.eqiad.wmnet with OS bullseye completed: - restbase1025 (... [13:34:37] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:35:28] (03CR) 10Jbond: [C: 03+2] augeas_core: update augeas_core [puppet] - 10https://gerrit.wikimedia.org/r/962618 (owner: 10Jbond) [13:36:53] 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Ottomata) [13:37:05] (03PS1) 10Klausman: Revert "conftool-data: Add entry for recommendation-api-ng" [puppet] - 10https://gerrit.wikimedia.org/r/963033 [13:37:17] 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10Ottomata) Updated description and tagged #sre-access-requests [13:37:30] (03PS1) 10DCausse: rdf-streaming-updater: take a savepoint manually [deployment-charts] - 10https://gerrit.wikimedia.org/r/963057 (https://phabricator.wikimedia.org/T342149) [13:37:36] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961897 (owner: 10Majavah) [13:37:39] (03PS9) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [13:37:41] (03PS9) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [13:37:43] (03PS9) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [13:37:45] (03PS12) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [13:38:16] !log mw-page-content-change-enrich codfw - bump to 1.27.0 and set replicas to 12 while processing backlog - T347676 [13:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:20] T347676: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 [13:38:21] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:38:22] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:38:31] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:39:02] (03PS1) 10Klausman: Revert "VIPs: add DNS entries for new recommendation-api-ng service" [dns] - 10https://gerrit.wikimedia.org/r/963034 [13:39:16] (03CR) 10Brennen Bearnes: [C: 03+2] AVA: Make score.php not fail with Fatal Error after libphutil removal [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/933907 (https://phabricator.wikimedia.org/T340633) (owner: 10Aklapper) [13:39:43] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] AVA: Make score.php not fail with Fatal Error after libphutil removal [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/933907 (https://phabricator.wikimedia.org/T340633) (owner: 10Aklapper) [13:39:58] (03CR) 10Elukey: [C: 03+1] Revert "conftool-data: Add entry for recommendation-api-ng" [puppet] - 10https://gerrit.wikimedia.org/r/963033 (owner: 10Klausman) [13:40:13] (03CR) 10Elukey: [C: 03+1] Revert "VIPs: add DNS entries for new recommendation-api-ng service" [dns] - 10https://gerrit.wikimedia.org/r/963034 (owner: 10Klausman) [13:40:22] (03CR) 10Klausman: [C: 03+2] Revert "conftool-data: Add entry for recommendation-api-ng" [puppet] - 10https://gerrit.wikimedia.org/r/963033 (owner: 10Klausman) [13:40:30] (03CR) 10Klausman: [C: 03+2] Revert "VIPs: add DNS entries for new recommendation-api-ng service" [dns] - 10https://gerrit.wikimedia.org/r/963034 (owner: 10Klausman) [13:41:05] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: take a savepoint manually [deployment-charts] - 10https://gerrit.wikimedia.org/r/963057 (https://phabricator.wikimedia.org/T342149) (owner: 10DCausse) [13:41:39] !log taavi@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt-wdqs1001.eqiad.wmnet [13:41:53] (03Merged) 10jenkins-bot: rdf-streaming-updater: take a savepoint manually [deployment-charts] - 10https://gerrit.wikimedia.org/r/963057 (https://phabricator.wikimedia.org/T342149) (owner: 10DCausse) [13:42:01] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1025.eqiad.wmnet [13:42:02] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1025.eqiad.wmnet [13:42:33] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [13:43:28] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:43:30] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [13:43:35] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:43:40] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1033.eqiad.wmnet with OS bullseye [13:43:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [13:43:56] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1033.eqiad.wmnet with OS bullseye [13:44:09] (03CR) 10Fabfur: purged: use unix socket for varnish in codfw [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:44:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43851/console" [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:44:22] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [13:44:45] btullis: thanks so much for the unlocking for Surbhi - she still has issues ssh-ing deployment, but ht'at be for tomorrow :) [13:44:46] (03CR) 10Clément Goubert: [C: 03+1] services: fix xenon/arclamp redis egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/963024 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi) [13:46:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:34] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2002.codfw.wmnet with reason: host reimage [13:46:34] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [13:46:48] (03PS10) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [13:46:50] (03PS10) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [13:46:52] (03PS10) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [13:46:54] (03PS13) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [13:47:30] (03CR) 10CI reject: [V: 04-1] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:47:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:48:08] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [13:48:13] !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Revert allocation of LVS VIPs for recommendation-api-ng - klausman@cumin1001" [13:48:56] (03CR) 10Vgutierrez: [C: 03+1] purged: use unix socket for varnish in codfw [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:49:11] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2002.codfw.wmnet with reason: host reimage [13:49:40] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:49:41] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt-wdqs1001.eqiad.wmnet [13:49:46] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-master1004 [13:50:36] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Revert allocation of LVS VIPs for recommendation-api-ng - klausman@cumin1001" [13:50:37] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-master1004 [13:51:03] (03CR) 10Klausman: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/963059 (owner: 10Klausman) [13:51:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:51:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43852/console" [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [13:52:32] (03CR) 10Fabfur: [C: 03+2] purged: use unix socket for varnish in codfw [puppet] - 10https://gerrit.wikimedia.org/r/963020 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:52:46] (03CR) 10Ottomata: k8s config: Provide kafka and zookeeper hostnames (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [13:52:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:54:51] (03PS1) 10Jbond: P:dumps::distribution::ferm: pass array directly do ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/963062 (https://phabricator.wikimedia.org/T346165) [13:56:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43853/console" [puppet] - 10https://gerrit.wikimedia.org/r/963062 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond) [13:57:08] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage [13:57:48] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-master1003 [13:58:09] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T348001 (10SCampos-WMF) [13:58:14] (03PS11) 10Jbond: wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) [13:58:17] (03PS11) 10Jbond: P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) [13:58:19] (03PS11) 10Jbond: prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) [13:58:21] (03PS14) 10Jbond: get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) [13:58:57] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-master1003 [13:59:36] PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:59:39] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [13:59:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [14:01:19] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T348001 (10RhinosF1) @SCampos-WMF: Can you please link your wikitech account to your phabricator account? I suspect 'wmf' will be the correct group for you. [14:01:36] 10SRE, 10LDAP-Access-Requests: Grant Access to for Sara Campos - https://phabricator.wikimedia.org/T348001 (10RhinosF1) [14:01:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1033.eqiad.wmnet with reason: host reimage [14:01:51] !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963020 (T347837). `purged` daemon will be restarted by puppet in codfw in the next 30m [14:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:04] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [14:02:12] jouncebot: now and next [14:02:13] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [14:02:20] (03PS5) 10Herron: thanos-rule: add pyrra filesystem operator output dir to search path [puppet] - 10https://gerrit.wikimedia.org/r/930628 (https://phabricator.wikimedia.org/T302995) [14:03:24] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10RhinosF1) [14:03:30] PROBLEM - Query Service HTTP Port on wdqs1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 364 bytes in 9.346 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:04:14] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10RhinosF1) I've updated the description to wmf for you @SCampos-WMF as I see you have an @wikimedia.org email and that access allows matomo. ldap pulled with https://ldap.toolforge.org/user/scampos [14:04:22] RECOVERY - Query Service HTTP Port on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:04:24] 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [14:04:32] RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:05:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 9 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43854/console" [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [14:06:20] (03CR) 10Ottomata: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [14:07:09] (03CR) 10Herron: [C: 03+1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [14:07:54] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2002.codfw.wmnet with OS bullseye [14:07:59] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Remove thanos components from thanos-fe role and reimage thanos-fe hosts - https://phabricator.wikimedia.org/T346143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin100... [14:08:30] (03CR) 10Herron: [C: 03+1] profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [14:09:20] (03CR) 10Herron: [C: 03+1] syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [14:11:41] (03CR) 10Herron: [C: 03+1] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [14:16:29] 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [14:18:10] (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.4 [software/homer] - 10https://gerrit.wikimedia.org/r/963069 [14:19:48] (03CR) 10Volans: "couple of nits, lgtm otherwise" [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi) [14:19:56] (03CR) 10CI reject: [V: 04-1] CHANGELOG: add changelogs for release v0.6.4 [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi) [14:21:02] (03PS2) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.4 [software/homer] - 10https://gerrit.wikimedia.org/r/963069 [14:21:06] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:24] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:34] (03CR) 10Ayounsi: [C: 03+2] CHANGELOG: add changelogs for release v0.6.4 (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi) [14:23:39] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi) [14:25:28] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.4 [software/homer] - 10https://gerrit.wikimedia.org/r/963069 (owner: 10Ayounsi) [14:26:16] PROBLEM - WDQS SPARQL on wdqs2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:31:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1033.eqiad.wmnet with OS bullseye [14:31:31] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1033.eqiad.wmnet with OS bullseye completed: - restbase1033 (... [14:33:24] (03CR) 10Btullis: [C: 03+1] P:dumps::distribution::ferm: pass array directly do ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/963062 (https://phabricator.wikimedia.org/T346165) (owner: 10Jbond) [14:33:44] (03PS1) 10Ayounsi: Release v0.6.4 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/963079 [14:34:10] (03CR) 10Filippo Giunchedi: [C: 03+2] services: fix xenon/arclamp redis egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/963024 (https://phabricator.wikimedia.org/T347926) (owner: 10Filippo Giunchedi) [14:35:11] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/963079 (owner: 10Ayounsi) [14:35:22] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:35:37] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:35:38] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:36:03] 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [14:36:06] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:36:07] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:36:24] RECOVERY - WDQS SPARQL on wdqs2022 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 0.366 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:36:24] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:36:25] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:36:42] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:36:44] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:36:44] (03PS1) 10Ottomata: mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T266798) [14:36:54] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:36:55] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:37:12] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:37:14] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:37:23] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:37:24] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:37:27] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003'] [14:37:36] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:37:37] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [14:37:41] ye olde wall of SAL [14:37:47] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [14:37:48] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [14:37:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003'] [14:38:00] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [14:38:02] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003'] [14:38:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003'] [14:38:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003'] [14:38:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003'] [14:38:26] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:38:42] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:38:47] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:07] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:39:09] (03PS1) 10Fabfur: purged: use unix socket for varnish in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/963081 (https://phabricator.wikimedia.org/T347837) [14:39:10] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:39:16] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003'] [14:39:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003'] [14:39:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003'] [14:39:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003'] [14:40:07] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003'] [14:42:05] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43855/console" [puppet] - 10https://gerrit.wikimedia.org/r/963081 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [14:42:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003'] [14:42:33] !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['an-master1003'] [14:42:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:43:16] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:43:41] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003'] [14:43:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1003'] [14:43:56] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:44:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:44:52] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:45:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:45:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:45:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:45:52] (03CR) 10Ayounsi: [C: 03+2] Release v0.6.4 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/963079 (owner: 10Ayounsi) [14:46:02] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:46:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:46:33] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:46:36] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:46:52] (03PS2) 10Ottomata: mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T347676) [14:46:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-master1003'] [14:47:24] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:47:38] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:47:50] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:48:32] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - ayounsi@cumin1001 [14:48:47] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:39] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:49:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-master1004'] [14:49:51] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1004'] [14:50:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - ayounsi@cumin1001 [14:53:13] (03CR) 10Anzx: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [14:53:40] (03CR) 10CI reject: [V: 04-1] fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [14:55:22] (03PS3) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) [14:55:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-master1004'] [14:56:21] (03CR) 10CI reject: [V: 04-1] fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [14:56:58] (03PS4) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) [14:58:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:00:25] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10Aklapper) >>! In T348001#9220485, @RhinosF1 wrote: > @SCampos-WMF: Can you please link your wikitech account to your phabricator account? That would welcome hints how to do that, especially if you... [15:03:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:05:26] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: Phabricator deploys [15:05:40] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: Phabricator deploys [15:05:59] !log brennen@deploy2002 Started deploy [phabricator/deployment@6f19600]: test deploy to phab2002 for T348007 [15:06:07] T348007: Deploy Phabricator/Phorge 2023-10-03 - https://phabricator.wikimedia.org/T348007 [15:06:26] (03CR) 10Elukey: [C: 03+1] "elukey@stat1004:~$ curl "https://recommendation-api-ng.discovery.wmnet:31443/api/spec" -i --http1.1 --resolve recommendation-api-ng.discov" [dns] - 10https://gerrit.wikimedia.org/r/963059 (owner: 10Klausman) [15:06:31] !log brennen@deploy2002 Finished deploy [phabricator/deployment@6f19600]: test deploy to phab2002 for T348007 (duration: 00m 32s) [15:06:53] !log brennen@deploy2002 Started deploy [phabricator/deployment@6f19600]: deploy to phab1004 for T348007 [15:07:01] (03CR) 10Klausman: [C: 03+2] mwnet: Add CNAMES for recommendation-api-ng running on ml-k8s [dns] - 10https://gerrit.wikimedia.org/r/963059 (owner: 10Klausman) [15:07:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10ayounsi) 05Open→03Resolved Homer 0.6.4 released. [15:07:37] !log brennen@deploy2002 Finished deploy [phabricator/deployment@6f19600]: deploy to phab1004 for T348007 (duration: 00m 44s) [15:07:59] (03Abandoned) 10Klausman: hiera/services: add service for recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/963013 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [15:08:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:09:39] (03PS2) 10Jdrewniak: Promote several Wikipedias to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962684 (https://phabricator.wikimedia.org/T347321) (owner: 10Jdlrobson) [15:10:08] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1033.eqiad.wmnet [15:10:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1033.eqiad.wmnet [15:10:43] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [15:10:45] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1026.eqiad.wmnet [15:11:18] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1026.eqiad.wmnet [15:12:32] (03PS1) 10DLynch: Enable Edit Check on initial partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963084 (https://phabricator.wikimedia.org/T347908) [15:13:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:13:58] (03CR) 10Ryan Kemper: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [15:15:48] (03PS1) 10Hnowlan: helmfile: add edit-analytics and editor-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/963086 (https://phabricator.wikimedia.org/T336415) [15:17:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) cp1112 - D 2. U 1. CableID 20220171 port 21 cp1113 - D 4. U 29 CableID 230304500241 port 6 cp1114 - D 4. U 38 CableID 230304500243 port 8 cp1115 - D 7. U 20 CableID 2303045... [15:17:26] (03PS3) 10Jclark-ctr: add an-master100(3,4) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/962621 (https://phabricator.wikimedia.org/T342291) [15:17:28] (03PS1) 10Jclark-ctr: add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291) [15:18:13] (03CR) 10TChin: [C: 03+1] mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T347676) (owner: 10Ottomata) [15:19:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [15:20:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed with error... [15:20:49] (03CR) 10Ottomata: [C: 03+2] mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T347676) (owner: 10Ottomata) [15:21:49] (03Merged) 10jenkins-bot: mw-page-content-change-enrich - set replicas: 2 for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/963080 (https://phabricator.wikimedia.org/T347676) (owner: 10Ottomata) [15:22:24] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [15:22:53] (03CR) 10Jclark-ctr: [C: 03+2] add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [15:23:06] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) [15:23:22] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1026.eqiad.wmnet [15:23:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1026.eqiad.wmnet [15:23:27] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [15:23:33] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:24:06] (03PS1) 10Cwhite: logstash: increase timeout for curator delete actions [puppet] - 10https://gerrit.wikimedia.org/r/962236 (https://phabricator.wikimedia.org/T347976) [15:24:08] !log mw-page-content-change-enrich - backfill is done, set replicas to 2 in eqiad and codfw [15:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:33] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1026.eqiad.wmnet with OS bullseye [15:24:42] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1026.eqiad.wmnet with OS bullseye [15:24:58] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) @Ladsgroup: I completely agree! Thank you for letting us know about the standardization and it makes total sense to be similar to the glam-us one. I already talk... [15:26:41] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [15:26:48] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:27:26] (03CR) 10Papaul: [C: 03+2] add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [15:27:30] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:46] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos-rule: add pyrra filesystem operator output dir to search path [puppet] - 10https://gerrit.wikimedia.org/r/930628 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:32:55] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:33:06] (03CR) 10Herron: [C: 03+2] thanos-rule: add pyrra filesystem operator output dir to search path [puppet] - 10https://gerrit.wikimedia.org/r/930628 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:34:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:35:05] (03PS2) 10Papaul: add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [15:35:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10RobH) [15:36:27] (03CR) 10Papaul: [C: 03+2] add an-masters1003,4 to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/963087 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [15:37:27] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1026.eqiad.wmnet with reason: host reimage [15:37:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [15:37:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed with error... [15:39:00] (03CR) 10Clément Goubert: [C: 03+1] helmfile: add edit-analytics and editor-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/963086 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan) [15:40:39] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1026.eqiad.wmnet with reason: host reimage [15:42:05] (03CR) 10Jclark-ctr: [C: 03+2] add an-master100(3,4) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/962621 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [15:44:30] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:47] RECOVERY - Check systemd state on wdqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:53] (03CR) 10Hnowlan: [C: 03+2] helmfile: add edit-analytics and editor-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/963086 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan) [15:46:39] (03PS3) 10Jdrewniak: Promote several Wikipedias to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962684 (https://phabricator.wikimedia.org/T347321) (owner: 10Jdlrobson) [15:46:58] (03Merged) 10jenkins-bot: helmfile: add edit-analytics and editor-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/963086 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan) [15:47:49] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10Jhancock.wm) [15:49:30] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:42] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [15:49:44] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [15:49:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [15:49:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [15:51:52] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:57:00] (03CR) 10Jon Harald Søby: [C: 04-1] "The wordmark file ("Wikipedya") uses the incorrect W, and has some kerning issues at the Y/A border. Please make a version of it that is b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [15:57:16] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:57:22] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [15:57:34] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:59:08] (03PS1) 10DLynch: Edit check: Simplify "experience" config to "maximumEditcount" [extensions/VisualEditor] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963041 [15:59:21] (03PS1) 10DLynch: Edit check: Simplify "experience" config to "maximumEditcount" [extensions/VisualEditor] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963042 [16:00:03] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:04] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1600). [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:21] (03CR) 10Jon Harald Søby: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [16:00:59] o/ [16:01:11] dancy: looking now [16:01:43] It's just https://gerrit.wikimedia.org/r/c/operations/puppet/+/961893 . The other one got merged last week. [16:01:53] (03CR) 10Jbond: [C: 03+2] logspam-watch: Add refreshing indicator [puppet] - 10https://gerrit.wikimedia.org/r/961893 (owner: 10Ahmon Dancy) [16:01:53] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:02:09] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [16:03:10] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [16:03:11] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [16:03:29] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) 05Open→03Resolved Done. Just note that I created it as a public mailing list but if you want it private, you can change the settings in https://lists.wikimedia.org... [16:03:44] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [16:04:02] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [16:04:47] dancy: merged and deployed to mwlog [16:05:05] (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:05:40] Thanks! It's working properly. [16:06:04] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [16:06:04] great :) [16:06:22] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [16:07:16] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [16:07:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1026.eqiad.wmnet with OS bullseye [16:08:51] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [16:09:00] (03PS2) 10Ryan Kemper: airflow-wmde: Add wmde service user to the Yarn production queue [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [16:09:19] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [16:10:00] (03CR) 10Ryan Kemper: "Pushed a patch that attempts to fix this CI error from https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/7200" [puppet] - 10https://gerrit.wikimedia.org/r/949019 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [16:11:56] (03CR) 10Ottomata: "Related: https://phabricator.wikimedia.org/T336901" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [16:16:25] (03CR) 10Ahmon Dancy: "The changes look OK to me." [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert) [16:19:43] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [16:19:59] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [16:20:00] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [16:20:24] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [16:23:57] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1027.eqiad.wmnet [16:24:26] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1027.eqiad.wmnet [16:27:01] (03CR) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [16:27:27] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [16:30:42] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10Ottomata) [16:33:51] (03CR) 10Andrew Bogott: [C: 03+1] "go go go go" [puppet] - 10https://gerrit.wikimedia.org/r/963029 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [16:36:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1027.eqiad.wmnet [16:36:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1027.eqiad.wmnet [16:37:18] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1027.eqiad.wmnet with OS bullseye [16:37:29] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1027.eqiad.wmnet with OS bullseye [16:38:30] (03CR) 10Jon Harald Søby: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [16:39:12] (03PS5) 10Anzx: fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) [16:41:00] (03PS2) 10Anzx: fonwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) [16:43:40] (03PS4) 10Majavah: dumps: replace WMCS paging Icinga check with Blackbox probe [puppet] - 10https://gerrit.wikimedia.org/r/961897 [16:44:59] (03PS1) 10Hnowlan: edit-analytics, editor-analytics: correct networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963104 (https://phabricator.wikimedia.org/T336415) [16:45:58] (03PS1) 10DCausse: rdf-streaming-updater: simplify parallelism and use newer kafka APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963105 (https://phabricator.wikimedia.org/T326914) [16:47:16] (03CR) 10Jon Harald Søby: [C: 03+1] fonwiki: add wgSiteName, wgMetaNamespace, add project namespace, timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963036 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [16:49:14] (03CR) 10Majavah: [C: 03+2] dumps: replace WMCS paging Icinga check with Blackbox probe [puppet] - 10https://gerrit.wikimedia.org/r/961897 (owner: 10Majavah) [16:50:04] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1027.eqiad.wmnet with reason: host reimage [16:51:21] (03CR) 10Anzx: fonwiki: add logos (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [16:51:43] (03CR) 10Jon Harald Søby: [C: 03+1] fonwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963066 (https://phabricator.wikimedia.org/T347939) (owner: 10Anzx) [16:52:34] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1027.eqiad.wmnet with reason: host reimage [16:54:08] (03PS1) 10Hnowlan: wmnet: add records for edit-, editor- and page-analytics [dns] - 10https://gerrit.wikimedia.org/r/963106 (https://phabricator.wikimedia.org/T336391) [16:54:22] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:54:32] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10SCampos-WMF) Thank you for sharing this, it was very useful :D ! @RhinosF1 I was able to link my wikitech account to my phabricator account! [16:56:37] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2054 hosts in codfw - jhancock@cumin2002" [16:57:15] (03CR) 10Herron: [C: 03+1] logstash: increase timeout for curator delete actions [puppet] - 10https://gerrit.wikimedia.org/r/962236 (https://phabricator.wikimedia.org/T347976) (owner: 10Cwhite) [16:57:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2054 hosts in codfw - jhancock@cumin2002" [16:57:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:58:38] (03CR) 10Bking: [C: 03+1] rdf-streaming-updater: simplify parallelism and use newer kafka APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963105 (https://phabricator.wikimedia.org/T326914) (owner: 10DCausse) [16:58:42] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: simplify parallelism and use newer kafka APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/963105 (https://phabricator.wikimedia.org/T326914) (owner: 10DCausse) [16:59:48] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:59:57] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:00:05] (03CR) 10Hnowlan: [C: 03+2] edit-analytics, editor-analytics: correct networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963104 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1700) [17:00:55] (03Merged) 10jenkins-bot: edit-analytics, editor-analytics: correct networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/963104 (https://phabricator.wikimedia.org/T336415) (owner: 10Hnowlan) [17:02:20] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:04:16] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [17:04:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [17:05:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:26] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2054 hosts in codfw - jhancock@cumin2002" [17:09:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked kubernetes2054 hosts in codfw - jhancock@cumin2002" [17:09:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:09:20] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:09:57] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [17:10:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [17:10:25] (03CR) 10Hnowlan: [C: 03+1] "Oops, my bad." [puppet] - 10https://gerrit.wikimedia.org/r/962048 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans) [17:11:50] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:13:06] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10RhinosF1) That should be good for the SRE for the clinic this week to handle then :) [17:15:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:48] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1027.eqiad.wmnet with OS bullseye [17:17:59] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1027.eqiad.wmnet with OS bullseye completed: - restbase1027 (... [17:21:31] 10SRE, 10Cassandra, 10Data-Persistence, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [17:24:16] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963110 (https://phabricator.wikimedia.org/T347080) [17:24:18] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963110 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [17:25:00] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sara Campos - https://phabricator.wikimedia.org/T348001 (10SCampos-WMF) Great, thank you for the guidance! [17:27:13] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963110 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [17:27:36] !log jhuneidi@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.29 refs T347080 [17:27:40] T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080 [17:28:25] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1027.eqiad.wmnet [17:28:25] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1027.eqiad.wmnet [17:33:23] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [17:33:28] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [17:33:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [17:33:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [17:33:55] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [17:34:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [17:34:34] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [17:34:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [17:34:58] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [17:35:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [17:37:34] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [17:38:01] (03CR) 10Eevans: [C: 03+2] install_server: utilize reuse recipe for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/962048 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans) [17:49:32] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [17:58:00] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) Thank you so much, @Ladsgroup! We really appreciate this and have started to share it with folks already. 🙌 [17:58:09] 10SRE, 10Infrastructure-Foundations: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036 (10Eevans) [17:59:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:00:05] jeena and dduvall: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1800). Please do the needful. [18:04:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:11:00] !log jhuneidi@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.29 refs T347080 (duration: 43m 24s) [18:11:04] T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080 [18:12:11] (03Abandoned) 10Varnent: [foundationwiki] Grant translation admin rights to 'sysop' and 'global-sysop' groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959354 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent) [18:13:17] !log jhuneidi@deploy2002 Pruned MediaWiki: 1.41.0-wmf.27 (duration: 02m 14s) [18:15:01] (03PS1) 10Krinkle: Remove wgResourceLoaderStorageVersion override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963124 (https://phabricator.wikimedia.org/T343407) [18:16:51] jouncebot: now [18:16:51] For the next 1 hour(s) and 43 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T1800) [18:16:54] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) [18:17:07] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) [18:17:13] 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) [18:17:31] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) p:05Triage→03Medium [18:17:49] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963125 (https://phabricator.wikimedia.org/T347080) [18:17:53] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963125 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:18:28] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963125 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:20:27] (03Abandoned) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [18:21:35] (03Abandoned) 10Ebernhardson: flink-app: Provide kafka hosts as properties file [deployment-charts] - 10https://gerrit.wikimedia.org/r/959066 (owner: 10Ebernhardson) [18:21:46] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) Hi @Ladsgroup. I'm sorry for reopening the ticket again but someone just flagged to me that "EU" can be problematic because it could mean only countries within t... [18:21:59] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) 05Resolved→03Open [18:23:53] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) Rename is not that easily possible. I can delete the mailing list and create it again and mass subscribe previous members. That means all settings changes will be gone... [18:25:02] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) That's fine and no problem at all from our side, @Ladsgroup! That would help us a lot actually. Thank you so much! [18:25:16] (03PS6) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) [18:25:20] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frauth2001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340153 (10Papaul) 05Open→03Resolved ` papaul@fasw-c-codfw# show |compare [edit interfaces interface-range disabled] member "ge-[0-1]/0/16" { ... } + member "ge-[0-1]/0/17";... [18:25:22] (03CR) 10Ebernhardson: "To keep things moving I've narrowed down the scope of this patch, removing the functionality to source zookeeper host/port based on a clus" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [18:25:46] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.29 refs T347080 [18:25:50] T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080 [18:26:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:27:02] (03CR) 10Jdrewniak: "This change is ready for review." [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [18:30:55] (03PS1) 10Ebernhardson: flink-app chart: Add zookeeper to egress_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/963130 [18:31:34] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:31:34] (03PS7) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) [18:31:36] (03CR) 10Ebernhardson: Pull some flink config down into the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [18:31:56] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) We can and probably should have a backup static routes for each of `ns[01]` but it can be to a single host instead of al... [18:37:49] (03PS8) 10Ebernhardson: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) [18:38:05] (03CR) 10Ebernhardson: "With the scope reduced, i think the main question remaining here is if these opinionated paths are the ones we want to use going forward. " [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T336901) (owner: 10Ebernhardson) [18:48:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [18:48:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [18:48:47] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:52:11] (03CR) 10Bking: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [18:52:23] (03CR) 10Bking: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [18:52:38] (03PS9) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [18:53:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [18:55:35] (03CR) 10Bking: [C: 03+2] elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [19:02:26] (03PS1) 10Ryan Kemper: elastic: remove rack info for old hosts [puppet] - 10https://gerrit.wikimedia.org/r/963134 (https://phabricator.wikimedia.org/T316728) [19:02:37] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10Ladsgroup) 05Open→03Resolved {{done}} [19:05:46] (03PS2) 10Ryan Kemper: elastic: remove rack info for old hosts [puppet] - 10https://gerrit.wikimedia.org/r/963134 (https://phabricator.wikimedia.org/T316728) [19:09:02] (03CR) 10Bking: [C: 03+1] elastic: remove rack info for old hosts [puppet] - 10https://gerrit.wikimedia.org/r/963134 (https://phabricator.wikimedia.org/T316728) (owner: 10Ryan Kemper) [19:09:08] (03CR) 10Ryan Kemper: [C: 03+2] elastic: remove rack info for old hosts [puppet] - 10https://gerrit.wikimedia.org/r/963134 (https://phabricator.wikimedia.org/T316728) (owner: 10Ryan Kemper) [19:10:44] Hi. jelto do you have a few minutes? [19:15:25] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [19:15:27] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [19:15:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [19:15:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [19:15:51] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [19:15:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [19:15:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [19:15:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [19:16:31] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [19:16:33] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [19:16:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [19:16:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [19:23:11] (03CR) 10C. Scott Ananian: [C: 03+1] Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [19:31:18] (03CR) 10Ryan Kemper: "Despite the CirrusSearch patch referenced in my last comment, we're not seeing any metrics for MediaWiki.CirrusSearch.eqiad.backend_failur" [puppet] - 10https://gerrit.wikimedia.org/r/830240 (https://phabricator.wikimedia.org/T316712) (owner: 10Ebernhardson) [19:38:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [19:38:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [19:38:50] (03PS3) 10Jdrewniak: Web typography prototype survey [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) [19:38:58] (03PS1) 10Jdrewniak: Correct a recently-added message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 [19:39:06] (03PS1) 10Jdrewniak: [Prototype] Change i18n message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) [19:41:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [19:41:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [19:52:31] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10BBlack) Looks about right to me! [19:57:15] (03CR) 10BBlack: [C: 03+1] purged: use unix socket for varnish in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/963081 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T2000). [20:00:05] jdrewniak and sbailey: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:29] I am here with cscott [20:00:55] * jan_drewniak o/ [20:04:01] * jan_drewniak sbailey: if the regular deployers don't show up, I can do the deploy [20:04:32] ok [20:05:47] sbailey: I can do yours first since it's just a config change [20:06:33] ok, :-) [20:08:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [20:09:19] (03PS8) 10Jdrewniak: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [20:09:35] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [20:10:04] hello, all. [20:11:16] (03Merged) 10jenkins-bot: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [20:11:40] https://www.mediawiki.org/wiki/Help:Extension:ParserMigration shows (interalia) how to test that this is working correctly on labs [20:11:49] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:944978|Re-enable Extension:ParserMigration on labs (T333179)]] [20:12:03] T333179: (Re)deploy ParserMigration extension to production - https://phabricator.wikimedia.org/T333179 [20:16:04] (03CR) 10Fabfur: [V: 03+1 C: 03+2] purged: use unix socket for varnish in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/963081 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [20:16:26] https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version should eventually show ParserMigration as well [20:16:57] !log merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/963081 (T347837). `purged` daemon will be restarted by puppet in eqsin in the next 30m [20:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:01] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [20:17:30] 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [20:18:28] https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version#mw-version-ext-specialpage-ParserMigration [20:20:11] (03PS1) 10Fabfur: purged: use unix socket for varnish in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/963147 [20:23:30] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43857/console" [puppet] - 10https://gerrit.wikimedia.org/r/963147 (owner: 10Fabfur) [20:23:46] jan_drewniak: has the ParserMigration config been synced or are we still waiting for it? [20:24:52] 10SRE, 10Wikimedia-Mailing-lists: Mailing list for the GLAM-Wiki activities in Europe - https://phabricator.wikimedia.org/T347917 (10GFontenelle_WMF) @Ladsgroup: Thank you so much! [20:25:48] cscott: still waiting... [20:29:52] Tested looking good, thanks [20:34:06] !log jdrewniak@deploy2002 jdrewniak and sbailey: Backport for [[gerrit:944978|Re-enable Extension:ParserMigration on labs (T333179)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:34:14] T333179: (Re)deploy ParserMigration extension to production - https://phabricator.wikimedia.org/T333179 [20:34:39] * jan_drewniak sbailey, cscott: ok finally, its on mwdebug [20:35:01] sbailey, cscott: ok finally, its on mwdebug [20:35:01] We both tested it, looking good :-). Thanks Jan [20:35:06] !log jdrewniak@deploy2002 jdrewniak and sbailey: Continuing with sync [20:43:26] (03CR) 10Cwhite: [C: 03+2] logstash: increase timeout for curator delete actions [puppet] - 10https://gerrit.wikimedia.org/r/962236 (https://phabricator.wikimedia.org/T347976) (owner: 10Cwhite) [20:44:34] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (GET blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:46:32] (03CR) 10Jdrewniak: [C: 03+2] Promote several Wikipedias to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962684 (https://phabricator.wikimedia.org/T347321) (owner: 10Jdlrobson) [20:47:50] (03Merged) 10jenkins-bot: Promote several Wikipedias to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962684 (https://phabricator.wikimedia.org/T347321) (owner: 10Jdlrobson) [20:49:34] (KubernetesAPILatency) resolved: (9) High Kubernetes API latency (GET blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:49:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [20:49:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [20:50:42] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:944978|Re-enable Extension:ParserMigration on labs (T333179)]] (duration: 38m 52s) [20:50:45] (03CR) 10Jdrewniak: [C: 03+2] Web typography prototype survey [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [20:50:45] T333179: (Re)deploy ParserMigration extension to production - https://phabricator.wikimedia.org/T333179 [20:50:53] (03CR) 10Jdrewniak: [C: 03+2] Correct a recently-added message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak) [20:50:58] (03CR) 10Jdrewniak: [C: 03+2] [Prototype] Change i18n message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [20:51:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [20:51:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak) [20:52:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [20:56:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [20:56:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [21:03:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:03:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak) [21:03:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:07:02] (03CR) 10CI reject: [V: 04-1] Web typography prototype survey [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:07:14] (03CR) 10CI reject: [V: 04-1] Correct a recently-added message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak) [21:07:30] (03CR) 10CI reject: [V: 04-1] [Prototype] Change i18n message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:08:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:08:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak) [21:08:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:09:09] (03CR) 10Jdrewniak: [C: 03+2] "recheck" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:13:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:13:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak) [21:13:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:23:14] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:962684|Promote several Wikipedias to Vector 2022 as default skin (T347321)]] [21:23:18] T347321: Deploy Vector 2022 as the default on next set of wikis - https://phabricator.wikimedia.org/T347321 [21:24:37] !log jdrewniak@deploy2002 jdlrobson and jdrewniak: Backport for [[gerrit:962684|Promote several Wikipedias to Vector 2022 as default skin (T347321)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:25:21] (03PS1) 10EoghanGaffney: [gitlab/switchover] Change profile::gitlab::service_name for switchover [puppet] - 10https://gerrit.wikimedia.org/r/963160 (https://phabricator.wikimedia.org/T345531) [21:26:20] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:26:23] !log jdrewniak@deploy2002 jdlrobson and jdrewniak: Continuing with sync [21:27:11] (03Merged) 10jenkins-bot: Web typography prototype survey [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963043 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:27:19] (03Merged) 10jenkins-bot: Correct a recently-added message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963137 (owner: 10Jdrewniak) [21:28:52] (03CR) 10Jdrewniak: [C: 03+2] "recheck" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:29:04] (03PS1) 10EoghanGaffney: [gitlab/switchover] Update DNS for gitlab/gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/963161 (https://phabricator.wikimedia.org/T345531) [21:32:40] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:962684|Promote several Wikipedias to Vector 2022 as default skin (T347321)]] (duration: 09m 26s) [21:32:45] T347321: Deploy Vector 2022 as the default on next set of wikis - https://phabricator.wikimedia.org/T347321 [21:33:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:38:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:43:05] (03Merged) 10jenkins-bot: [Prototype] Change i18n message [skins/Vector] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/963138 (https://phabricator.wikimedia.org/T347208) (owner: 10Jdrewniak) [21:43:36] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:963043|Web typography prototype survey (T347208)]], [[gerrit:963137|Correct a recently-added message]], [[gerrit:963138|[Prototype] Change i18n message (T347208)]] [21:43:40] T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208 [21:49:18] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10wiki_willy) ++ @Papaul , who's going to dig around a bit and provide some feedback [22:01:56] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:963043|Web typography prototype survey (T347208)]], [[gerrit:963137|Correct a recently-added message]], [[gerrit:963138|[Prototype] Change i18n message (T347208)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:02:09] T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208 [22:11:09] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [22:16:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:21:34] (KubernetesAPILatency) resolved: (13) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:22:44] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:963043|Web typography prototype survey (T347208)]], [[gerrit:963137|Correct a recently-added message]], [[gerrit:963138|[Prototype] Change i18n message (T347208)]] (duration: 39m 08s) [22:22:48] T347208: Launch Community Prototype - https://phabricator.wikimedia.org/T347208 [22:23:24] PROBLEM - Disk space on testreduce1001 is CRITICAL: DISK CRITICAL - free space: /srv/data 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=testreduce1001&var-datasource=eqiad+prometheus/ops [22:25:40] (03PS1) 10Volans: setup.py: upper limit for types-requests [cookbooks] - 10https://gerrit.wikimedia.org/r/963188 [22:25:42] (03PS1) 10Volans: sre.hosts.reimage: fix confctl repool command [cookbooks] - 10https://gerrit.wikimedia.org/r/963189 (https://phabricator.wikimedia.org/T347954) [22:32:33] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock CI on the other CRs. Happy to adapt if there is any post-merge comment." [cookbooks] - 10https://gerrit.wikimedia.org/r/963188 (owner: 10Volans) [22:35:06] (03Merged) 10jenkins-bot: setup.py: upper limit for types-requests [cookbooks] - 10https://gerrit.wikimedia.org/r/963188 (owner: 10Volans) [22:36:51] (03PS15) 10Volans: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [22:36:58] (03PS2) 10Volans: [gitlab/failover] Increase alert downtime duration [cookbooks] - 10https://gerrit.wikimedia.org/r/962636 (owner: 10EoghanGaffney) [22:48:47] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:58:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T343198)', diff saved to https://phabricator.wikimedia.org/P52808 and previous config saved to /var/cache/conftool/dbconfig/20231003-225803-arnaudb.json [22:58:07] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:12:54] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.027e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [23:13:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P52809 and previous config saved to /var/cache/conftool/dbconfig/20231003-231309-arnaudb.json [23:28:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P52810 and previous config saved to /var/cache/conftool/dbconfig/20231003-232815-arnaudb.json [23:43:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T343198)', diff saved to https://phabricator.wikimedia.org/P52811 and previous config saved to /var/cache/conftool/dbconfig/20231003-234322-arnaudb.json [23:43:24] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [23:43:26] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:43:37] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [23:43:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T343198)', diff saved to https://phabricator.wikimedia.org/P52812 and previous config saved to /var/cache/conftool/dbconfig/20231003-234343-arnaudb.json [23:49:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:50:45] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Htriedman) Hi all! I've made updates to the codebase to better comply with @Eevans' feedback, resulting in a greatly simplified int... [23:54:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency