[00:01:12] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [00:01:23] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1112.eqiad.wmnet with OS bullseye [00:01:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye completed: - cp1112 (**PASS**) - Remo... [00:17:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:10] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [00:31:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [00:39:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/969994 [00:39:03] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/969994 (owner: 10TrainBranchBot) [00:42:24] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/969994 (owner: 10TrainBranchBot) [01:01:18] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:36:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:50:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:51:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:54:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:12] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [02:26:27] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [02:35:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [02:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:42:28] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [02:47:28] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [02:50:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:53:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [02:54:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [03:04:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:10] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) In order to see if things are degrading, here's a point in time slice: P53122 [03:15:10] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [03:31:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [03:35:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.0213230928091366s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:40:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 3.4426888837674494s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:57:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:12] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [04:14:10] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.004e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [04:40:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:41:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:41:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:44:30] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:46:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 1.558 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:46:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:58:18] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - No response from remote host 185.15.58.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:08:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [05:13:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [05:26:47] * kart_ updating MinT [05:28:14] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-10-31-044726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/968388 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [05:28:59] (03Merged) 10jenkins-bot: Update MinT to 2023-10-31-044726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/968388 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [05:29:46] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:32:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:32:39] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:36:10] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:51] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [05:46:17] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [05:51:47] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [05:52:08] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [05:57:05] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [05:57:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T0600) [06:00:55] !log Updated MinT to 2023-10-31-044726-production (T333969, T349991, T349079, T340507) [06:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:03] T349991: MinT: Exception on /api/translate/nn/ff [POST] - https://phabricator.wikimedia.org/T349991 [06:01:03] T340507: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 [06:01:04] T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969 [06:01:04] T349079: Test instance for MinT keeps loading forever in some translations - https://phabricator.wikimedia.org/T349079 [06:02:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [06:07:43] (03CR) 10Andrea Denisse: prometheus: Add a default rsyslog destination for all sites (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [06:12:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:10] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:21:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:48] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:10] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [06:39:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [06:42:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:46:16] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:56:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:16] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:16] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:16] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:49] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner package to 16.3 [puppet] - 10https://gerrit.wikimedia.org/r/970652 (https://phabricator.wikimedia.org/T350215) [07:44:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [07:49:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [07:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:53:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/970652 (https://phabricator.wikimedia.org/T350215) (owner: 10Jelto) [07:56:52] (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969338 (owner: 10Muehlenhoff) [07:57:47] (03CR) 10Jelto: [C: 03+2] aptrepo: upgrade gitlab-ce and gitlab-runner package to 16.3 [puppet] - 10https://gerrit.wikimedia.org/r/970652 (https://phabricator.wikimedia.org/T350215) (owner: 10Jelto) [07:58:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [08:00:05] Amir1, Urbanecm, and taavi: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T0800) [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [08:03:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [08:05:35] (03PS1) 10Muehlenhoff: Switch builder role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/970655 [08:07:04] 10ops-eqiad: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Frostly) [08:12:43] (03PS1) 10Muehlenhoff: sretest: Enable nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/970716 [08:13:38] (03CR) 10Muehlenhoff: [C: 03+2] pybaltest: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969370 (owner: 10Muehlenhoff) [08:17:28] (03PS1) 10Muehlenhoff: Switch pybaltest to nftables [puppet] - 10https://gerrit.wikimedia.org/r/970718 [08:23:59] (03PS1) 10Muehlenhoff: Switch cuminunpriv to nftables [puppet] - 10https://gerrit.wikimedia.org/r/970720 [08:25:15] (03Abandoned) 10Muehlenhoff: webperf: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [08:30:37] (03PS1) 10Muehlenhoff: netbox: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970722 [08:34:52] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: GitLab version upgrade [08:39:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/970722 (owner: 10Muehlenhoff) [08:49:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [08:54:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [08:55:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:48] (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre: ignore systemd_unit_.+_owner stale textfile [alerts] - 10https://gerrit.wikimedia.org/r/970402 (https://phabricator.wikimedia.org/T349176) (owner: 10Filippo Giunchedi) [09:09:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:59] (03PS5) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335 [09:11:01] (03PS6) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) [09:11:03] (03PS2) 10Majavah: P:diffscan: add scan for WMCS infrastructure addresses [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) [09:11:56] !log installing curl security updates [09:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:01] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/283/con" [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [09:13:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:13] !log installing RT security updates [09:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:20] (03PS1) 10Muehlenhoff: RT: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970724 [09:34:00] (03PS2) 10Muehlenhoff: RT: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970724 [09:34:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/970724 (owner: 10Muehlenhoff) [09:46:11] !log installing ncurses security updates [09:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:14] !log installing yajl security updates [09:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1000) [10:02:17] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: GitLab version upgrade [10:03:47] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: GitLab version upgrade [10:04:29] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10JMeybohm) >>! In T344199#9229747, @ATsay-WMF wrote: > Hello, I'd like to request access to analytics-privatedata-users as well.... [10:05:56] (03CR) 10JMeybohm: [C: 03+2] admin: add amyt to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/964000 (https://phabricator.wikimedia.org/T344199) (owner: 10Jelto) [10:06:57] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10JMeybohm) 05Open→03Resolved [10:07:17] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10JMeybohm) [10:10:19] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: GitLab version upgrade [10:15:24] (03PS1) 10Majavah: systemd: allow passing source to a unit [puppet] - 10https://gerrit.wikimedia.org/r/970727 [10:15:26] (03PS1) 10Majavah: ldap: client: auto-restart sssd-nss on failure [puppet] - 10https://gerrit.wikimedia.org/r/970728 (https://phabricator.wikimedia.org/T349687) [10:22:52] !log installing adduser security updates [10:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:35] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [10:33:36] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: GitLab version upgrade [10:35:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/970281 (owner: 10Slyngshede) [10:36:21] (03CR) 10Muehlenhoff: modules: cleanup last dispatch renmants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) (owner: 10Filippo Giunchedi) [10:37:12] (03Abandoned) 10Muehlenhoff: aborrero: remove user [puppet] - 10https://gerrit.wikimedia.org/r/964940 (owner: 10Arturo Borrero Gonzalez) [10:40:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:44:20] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:39] (03PS1) 10Muehlenhoff: karapace: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970732 [10:53:38] (03PS2) 10Muehlenhoff: karapace: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970732 [10:59:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952459 (owner: 10Muehlenhoff) [11:03:12] (03PS1) 10Gerrit maintenance bot: Add dga to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/969995 (https://phabricator.wikimedia.org/T350218) [11:03:52] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10JMeybohm) Thanks for raising this @Urbanecm_WMF, we will have the docs clarified soon. Standard volunteer NDA access did not regularly require on C level approval in the pas... [11:05:45] (03PS1) 10Gerrit maintenance bot: Add zgh to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/970001 (https://phabricator.wikimedia.org/T350216) [11:11:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:18] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:42] (03CR) 10Filippo Giunchedi: modules: cleanup last dispatch renmants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) (owner: 10Filippo Giunchedi) [11:17:44] (03PS2) 10Filippo Giunchedi: modules: cleanup last dispatch renmants [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) [11:17:49] (03CR) 10Ladsgroup: [C: 03+1] Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [11:23:11] (03PS5) 10Stevemunene: druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042) [11:33:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see pcc at https://puppet-compiler.wmflabs.org/output/965561/285/" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [11:38:11] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [11:41:00] (03CR) 10Kamila Součková: [C: 03+1] Update benthos to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969366 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:42:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:12] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:15] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/970379 (https://phabricator.wikimedia.org/T348129) (owner: 10Ayounsi) [11:49:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:35] (03PS1) 10Aklapper: Exclude bot accounts from applying antivandalism thresholds [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/970742 (https://phabricator.wikimedia.org/T350245) [11:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:51:20] (03CR) 10Aklapper: "No time to test locally but pretty confident about this" [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/970742 (https://phabricator.wikimedia.org/T350245) (owner: 10Aklapper) [11:53:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:18] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney) >>! In T348178#9227051, @ayounsi wrote: >> Secondary Link Migration > Looking at link usage, it's fine to drop the seconda... [11:54:44] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [11:56:52] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: partial-backup.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) (owner: 10Filippo Giunchedi) [12:01:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [12:01:32] (03PS1) 10Ladsgroup: Set pagelinks migration in s4 to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970745 (https://phabricator.wikimedia.org/T345732) [12:05:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:59] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: GitLab version upgrade [12:06:24] (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:00] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:08:00] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:09:12] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:38] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:24] (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [12:19:51] jouncebot: nowandnext [12:19:51] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [12:19:51] In 0 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1300) [12:20:07] (03CR) 10Ladsgroup: [C: 03+2] Set pagelinks migration in s4 to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970745 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [12:20:51] (03Merged) 10jenkins-bot: Set pagelinks migration in s4 to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970745 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [12:22:06] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:970745|Set pagelinks migration in s4 to write both (T345732)]] [12:22:10] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [12:22:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [12:23:30] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:970745|Set pagelinks migration in s4 to write both (T345732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:24:06] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [12:25:50] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:31:02] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:19] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:970745|Set pagelinks migration in s4 to write both (T345732)]] (duration: 09m 12s) [12:31:36] (03PS1) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf #2 [homer/public] - 10https://gerrit.wikimedia.org/r/970767 (https://phabricator.wikimedia.org/T347030) [12:31:58] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [12:34:28] (03PS1) 10Jelto: sre.gitlab.upgrade: unpause runners during downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/970768 [12:37:05] (ProbeDown) firing: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:41:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [12:43:52] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt-wdqs1002'] [12:44:05] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt-wdqs1002'] [12:47:20] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [12:47:48] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [12:52:05] (ProbeDown) resolved: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:12] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:17] (03CR) 10Filippo Giunchedi: [C: 03+2] modules: cleanup last dispatch renmants [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) (owner: 10Filippo Giunchedi) [12:58:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:54] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add zgh to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/970001 (https://phabricator.wikimedia.org/T350216) (owner: 10Gerrit maintenance bot) [12:59:13] (03PS2) 10Ladsgroup: Add dga to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/969995 (https://phabricator.wikimedia.org/T350218) (owner: 10Gerrit maintenance bot) [12:59:17] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add dga to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/969995 (https://phabricator.wikimedia.org/T350218) (owner: 10Gerrit maintenance bot) [13:00:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:16] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:01:18] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:01:27] !log installing glib2.0 security updates [13:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:36] Any deployer around? [13:04:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:05:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:13] !log installing libx11 security updates [13:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:16] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:11:21] !log cmooney@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt-wdqs1002'] [13:11:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt-wdqs1002'] [13:12:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:18] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:28] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:19:36] (03CR) 10Majavah: [C: 03+1] "couple of minor things inline, but otherwise looks fine." [homer/public] - 10https://gerrit.wikimedia.org/r/970767 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [13:20:53] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:21:28] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:22:16] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:23:05] (03CR) 10Majavah: [C: 04-1] "This is doing too many things at once, I'd be much more comfortable with a couple of smaller patches that can be deployed separately." [puppet] - 10https://gerrit.wikimedia.org/r/970341 (https://phabricator.wikimedia.org/T350132) (owner: 10Cathal Mooney) [13:25:37] !log bking@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [13:25:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:29:08] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [13:32:25] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [13:33:21] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [13:34:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:39:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:43:35] (03CR) 10Muehlenhoff: [C: 03+2] Switch arclamp to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969328 (owner: 10Muehlenhoff) [13:47:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:16] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:43] (03PS2) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf #2 [homer/public] - 10https://gerrit.wikimedia.org/r/970767 (https://phabricator.wikimedia.org/T347030) [13:51:44] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp1114 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:51:46] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1113 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:51:52] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp1113 is CRITICAL: connect to address 10.64.53.17 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:51:52] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp1114 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:51:58] (03CR) 10Ssingh: [C: 03+1] Switch pybaltest to nftables [puppet] - 10https://gerrit.wikimedia.org/r/970718 (owner: 10Muehlenhoff) [13:51:58] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp1113 is CRITICAL: connect to address 10.64.53.17 and port 3127: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:52:00] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp1114 is CRITICAL: connect to address 10.64.48.27 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:52:03] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [13:52:05] er [13:52:06] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp1114 is CRITICAL: connect to address 10.64.48.27 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:52:11]  cp11xx are not in production, don't worry much [13:52:15] downtme expired [13:52:16] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [13:52:17] renewing [13:52:28] PROBLEM - Check systemd state on cp1114 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy_stek_job.service,varnishncsa.service,wmf_auto_restart_varnish-frontend-fetcherr.service,wmf_auto_restart_varnish-frontend-hospital.service,wmf_auto_restart_varnish-frontend-slowlog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:30] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 15 hosts with reason: not pooled, reimaging in progress [13:52:30] PROBLEM - Webrequests Varnishkafka log producer on cp1113 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:52:34] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp1114 is CRITICAL: connect to address 10.64.48.27 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:52:34] PROBLEM - Webrequests Varnishkafka log producer on cp1114 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [13:52:40] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp1113 is CRITICAL: connect to address 10.64.53.17 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:52:40] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp1114 is CRITICAL: connect to address 10.64.48.27 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [13:52:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 15 hosts with reason: not pooled, reimaging in progress [13:52:59] donwtimed again [13:53:10] in the process of being reimaged, no host is pooled [13:55:12] (03PS1) 10Ottomata: eventgate chart - add nodejs_extra_opts value [deployment-charts] - 10https://gerrit.wikimedia.org/r/970784 (https://phabricator.wikimedia.org/T347477) [13:55:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp1100.eqiad.wmnet with reason: not pooled, reimaging in progress [13:55:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp1100.eqiad.wmnet with reason: not pooled, reimaging in progress [14:00:06] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1400) [14:03:55] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) @VRiley-WMF cloudvirt-wdqs1002 is showing a media/cable failure when it tries to boot over network: {F41426317,width=600} That could be that the... [14:04:29] (03CR) 10Cathal Mooney: Deny traffic from cloud pub ranges to WMF private IPs and tidy conf #2 (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/970767 (https://phabricator.wikimedia.org/T347030) (owner: 10Cathal Mooney) [14:05:37] (03CR) 10Ottomata: [C: 03+2] eventgate chart - add nodejs_extra_opts value [deployment-charts] - 10https://gerrit.wikimedia.org/r/970784 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [14:06:34] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host vrts1001.eqiad.wmnet [14:06:43] (03Merged) 10jenkins-bot: eventgate chart - add nodejs_extra_opts value [deployment-charts] - 10https://gerrit.wikimedia.org/r/970784 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [14:06:47] (03PS1) 10Muehlenhoff: acme_chief: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970786 [14:08:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/970786 (owner: 10Muehlenhoff) [14:10:28] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1001.eqiad.wmnet [14:10:59] (03PS1) 10Papaul: Add new cloudclontrol and cloudnet to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/970788 (https://phabricator.wikimedia.org/T342455) [14:11:55] (03CR) 10Papaul: [C: 03+2] Add new cloudclontrol and cloudnet to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/970788 (https://phabricator.wikimedia.org/T342455) (owner: 10Papaul) [14:14:03] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [14:14:16] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [14:14:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Papaul) @Jclark-ctr @VRiley-WMF those are ready now for OS install. Thanks [14:14:54] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10lmata) [14:15:26] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-155326.scope,session-155328.scope,session-155400.scope,session-155401.scope,session-155437.scope,session-155634.scope,session-155642.scope,session-155646.scope,session-155647.scope,session-155665.scope,session-155666.scope,session-c3789.scope,session-c3790.scope,session-c3791.scope,session-c3792.scope,session-c3793.scope,session [14:15:26] cope,session-c3795.scope,session-c3796.scope,session-c3797.scope,session-c3798.scope,session-c3799.scope,session-c3800.scope,session-c3801.scope,session-c734.scope,user@0.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:29] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [14:15:42] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [14:16:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:52] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) >>! In T346948#9291643, @VRiley-WMF wrote: > cloudvirt-wdqs1003 has been relocated > > cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015 >... [14:28:59] (03PS2) 10Muehlenhoff: acme_chief: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970786 [14:33:00] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:20] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:28] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) @cmooney I have replaced the DAC cable and updated Netbox with the CableID; also I reseated the NIC for good measure. It is plugged into the sa... [14:36:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/970786 (owner: 10Muehlenhoff) [14:38:44] (03PS2) 10Urbanecm: Growth: Enable new Impact module on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969966 (https://phabricator.wikimedia.org/T336203) [14:39:36] jouncebot: nowandnext [14:39:37] For the next 0 hour(s) and 20 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1400) [14:39:37] In 2 hour(s) and 20 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1700) [14:39:39] (03CR) 10Urbanecm: [C: 03+2] Growth: Enable new Impact module on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969966 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [14:40:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:25] (03Merged) 10jenkins-bot: Growth: Enable new Impact module on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969966 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [14:41:29] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:969966|Growth: Enable new Impact module on all Wikipedias (T336203)]] [14:41:34] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [14:42:04] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-155326.scope,session-155328.scope,session-155400.scope,session-155401.scope,session-155437.scope,session-155634.scope,session-155642.scope,session-155646.scope,session-155647.scope,session-155665.scope,session-155666.scope,session-c3789.scope,session-c3790.scope,session-c3791.scope,session-c3792.scope,session-c3793.scope,session [14:42:04] cope,session-c3795.scope,session-c3796.scope,session-c3797.scope,session-c3798.scope,session-c3799.scope,session-c3800.scope,session-c3801.scope,session-c3802.scope,session-c734.scope,user-runtime-dir@114.service,user@0.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:43:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:07] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [14:44:14] (03CR) 10Muehlenhoff: [C: 03+2] Switch pybaltest to nftables [puppet] - 10https://gerrit.wikimedia.org/r/970718 (owner: 10Muehlenhoff) [14:45:50] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:969966|Growth: Enable new Impact module on all Wikipedias (T336203)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:46:53] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:47:12] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:11] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [14:50:26] (03PS2) 10Urbanecm: Growth: Disable new impact A/B testing on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969967 (https://phabricator.wikimedia.org/T336203) [14:50:32] (03CR) 10Urbanecm: [C: 03+2] Growth: Disable new impact A/B testing on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969967 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [14:52:11] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:969966|Growth: Enable new Impact module on all Wikipedias (T336203)]] (duration: 10m 41s) [14:52:14] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [14:52:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969967 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [14:52:29] (03Merged) 10jenkins-bot: Growth: Disable new impact A/B testing on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969967 (https://phabricator.wikimedia.org/T336203) (owner: 10Urbanecm) [14:52:53] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:969967|Growth: Disable new impact A/B testing on pilot wikis (T336203)]] [14:54:12] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:30] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [14:57:18] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:969967|Growth: Disable new impact A/B testing on pilot wikis (T336203)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:57:24] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:57:26] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [14:58:25] (03PS1) 10Ottomata: eventgate-analytics - set cpu and mem requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/970796 (https://phabricator.wikimedia.org/T347477) [14:59:13] (03PS2) 10Ottomata: eventgate-analytics - set cpu and mem requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/970796 (https://phabricator.wikimedia.org/T347477) [14:59:22] !log mwmaint2002: mwscript userOptions.php --wiki=cswiki --nowarn --old='oldimpact' --new='control' 'growthexperiments-homepage-variant' # end A/B testing of new Impact (T336203) [14:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:58] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [15:02:38] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:969967|Growth: Disable new impact A/B testing on pilot wikis (T336203)]] (duration: 09m 44s) [15:02:44] T336203: Positive reinforcement: Deploy the new Impact module to all Wikipedias - https://phabricator.wikimedia.org/T336203 [15:04:32] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-155326.scope,session-155328.scope,session-155400.scope,session-155401.scope,session-155437.scope,session-155634.scope,session-155642.scope,session-155646.scope,session-155647.scope,session-155665.scope,session-155666.scope,session-c3789.scope,session-c3790.scope,session-c3791.scope,session-c3792.scope,session-c3793.scope,session [15:04:32] cope,session-c3795.scope,session-c3796.scope,session-c3797.scope,session-c3798.scope,session-c3799.scope,session-c3800.scope,session-c3801.scope,session-c3802.scope,session-c3803.scope,session-c734.scope,user-runtime-dir@114.service,user@0.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:12] !log mwmaint2002: mwscript userOptions.php --wiki=WIKI --nowarn --old='oldimpact' --new='control' 'growthexperiments-homepage-variant' # end A/B testing of new Impact (T336203; wikis=arwiki bnwiki elwiki eswiki fawiki frwiki frwiktionary idwiki plwiki rowiki trwiki viwiki) [15:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:40] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - set cpu and mem requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/970796 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [15:06:59] (PuppetFailure) firing: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:07:54] (03Merged) 10jenkins-bot: eventgate-analytics - set cpu and mem requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/970796 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [15:10:42] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:44] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [15:12:15] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [15:12:29] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed wit... [15:16:22] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:16:24] (03PS1) 10Ottomata: eventgate-analytics - also set cpu and mem requests and limits for canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/970798 (https://phabricator.wikimedia.org/T347477) [15:20:03] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - also set cpu and mem requests and limits for canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/970798 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [15:21:39] (03Merged) 10jenkins-bot: eventgate-analytics - also set cpu and mem requests and limits for canary [deployment-charts] - 10https://gerrit.wikimedia.org/r/970798 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [15:21:59] (PuppetFailure) resolved: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:22:52] PROBLEM - Check systemd state on pybal-test2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-check-nft.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:11] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [15:23:25] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [15:25:06] (03CR) 10JMeybohm: [C: 03+2] Update benthos to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969366 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:25:38] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS bookworm [15:25:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1113.eqiad.wmnet with OS bookworm [15:26:08] (03Merged) 10jenkins-bot: Update benthos to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/969366 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [15:26:14] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1113.eqiad.wmnet with OS bookworm [15:26:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1113.eqiad.wmnet with OS bookworm executed with errors: - cp1113 (**FAIL**... [15:27:50] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS bullseye [15:27:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1113.eqiad.wmnet with OS bullseye [15:28:42] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [15:28:50] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:26] RECOVERY - Check systemd state on pybal-test2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:56] (03PS1) 10Muehlenhoff: cassandra: Avoid Ferm-specific syntax and simplify analytics access [puppet] - 10https://gerrit.wikimedia.org/r/970799 [15:35:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1113.eqiad.wmnet with OS bullseye executed with errors: - cp1113 (**FAIL**... [15:36:50] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS bullseye [15:37:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1113.eqiad.wmnet with OS bullseye [15:38:02] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/970799 (owner: 10Muehlenhoff) [15:40:34] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) slice from this morning: P53124 (by the way, I'm generating that with ` cumin1001:~$ sudo cumin "P{clou... [15:41:22] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:44] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:06] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:12] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:45] (03PS1) 10Ottomata: eventgate-analytics - revert cpu and mem requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/970804 (https://phabricator.wikimedia.org/T347477) [15:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:52:22] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [15:52:54] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - revert cpu and mem requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/970804 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [15:53:43] (03Merged) 10jenkins-bot: eventgate-analytics - revert cpu and mem requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/970804 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [15:53:50] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:10] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one last comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/968288 (owner: 10EoghanGaffney) [15:56:30] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:40] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [15:56:48] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:56] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) @Jclark-ctr had a look at the NIC riser card wasn't properly seated. After re-seating the card the server connection seems to be working, current... [15:57:36] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [15:57:42] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [15:57:54] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [15:57:58] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:22] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cp1113.eqiad.wmnet with OS bullseye [15:59:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1113.eqiad.wmnet with OS bullseye executed with errors: - cp1113 (**FAIL**... [15:59:33] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS bullseye [15:59:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1113.eqiad.wmnet with OS bullseye [15:59:42] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install 2 ganeti node expansion - https://phabricator.wikimedia.org/T349926 (10MoritzMuehlenhoff) [16:00:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install 2 ganeti node expansion for codfw - https://phabricator.wikimedia.org/T349926 (10MoritzMuehlenhoff) [16:01:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install 2 ganeti node expansion for codfw - https://phabricator.wikimedia.org/T349926 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH I've filled in the details in the task description, let me know if you need anything else. [16:01:06] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [16:01:36] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10RobH) [16:03:27] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet [16:03:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10RobH) a:05RobH→03None >>! In T349926#9298843, @MoritzMuehlenhoff wrote: > I've filled in the details in the task description, let me know if you need anything e... [16:04:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti for eqiad - https://phabricator.wikimedia.org/T349925 (10MoritzMuehlenhoff) [16:04:28] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: moving switch link from NIC port 2 to port 1 [16:04:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti for eqiad - https://phabricator.wikimedia.org/T349925 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH I've filled in the details in the task description, let me know if you need anything else. [16:04:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: moving switch link from NIC port 2 to port 1 [16:05:10] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b9bd4e38-25ed-4ed0-bdf7-47bd52027bdc) set by cmooney@cumin1001 for 1:00:00 on 1 host(s) an... [16:06:40] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:02] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:26] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:02] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [16:10:48] (03PS1) 10Jforrester: [Staging only] wikifunctions: Switch Python evaluator to new WASM image [deployment-charts] - 10https://gerrit.wikimedia.org/r/970806 (https://phabricator.wikimedia.org/T349736) [16:11:54] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:28] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:43] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet [16:15:20] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [16:18:17] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [16:18:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [16:18:50] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [16:19:42] jouncebot: nowandnext [16:19:42] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [16:19:42] In 0 hour(s) and 40 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1700) [16:19:53] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Switch Python evaluator to new WASM image [deployment-charts] - 10https://gerrit.wikimedia.org/r/970806 (https://phabricator.wikimedia.org/T349736) (owner: 10Jforrester) [16:20:46] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Switch Python evaluator to new WASM image [deployment-charts] - 10https://gerrit.wikimedia.org/r/970806 (https://phabricator.wikimedia.org/T349736) (owner: 10Jforrester) [16:21:12] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:22:01] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:25:12] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [16:25:31] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [16:26:04] (03PS1) 10Jforrester: wikifunctions: Switch Python evaluator to new, WASM-based image [deployment-charts] - 10https://gerrit.wikimedia.org/r/970808 (https://phabricator.wikimedia.org/T349736) [16:26:20] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Switch Python evaluator to new, WASM-based image [deployment-charts] - 10https://gerrit.wikimedia.org/r/970808 (https://phabricator.wikimedia.org/T349736) (owner: 10Jforrester) [16:27:17] (03Merged) 10jenkins-bot: wikifunctions: Switch Python evaluator to new, WASM-based image [deployment-charts] - 10https://gerrit.wikimedia.org/r/970808 (https://phabricator.wikimedia.org/T349736) (owner: 10Jforrester) [16:28:38] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:28:42] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:28:57] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:28:57] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:34] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:35:08] (03PS1) 10Jforrester: [Staging only] wikifunctions: Drop special Python staging evaluator values [deployment-charts] - 10https://gerrit.wikimedia.org/r/970809 [16:35:44] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Drop special Python staging evaluator values [deployment-charts] - 10https://gerrit.wikimedia.org/r/970809 (owner: 10Jforrester) [16:36:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1113.eqiad.wmnet with OS bullseye [16:36:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1113.eqiad.wmnet with OS bullseye completed: - cp1113 (**PASS**) - Remov... [16:36:42] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Drop special Python staging evaluator values [deployment-charts] - 10https://gerrit.wikimedia.org/r/970809 (owner: 10Jforrester) [16:39:30] (03PS2) 10Herron: logstash: add uri_host field to w3creportingapi template [puppet] - 10https://gerrit.wikimedia.org/r/969948 (https://phabricator.wikimedia.org/T349807) [16:40:30] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:40:58] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:41:16] (03CR) 10Herron: logstash: add uri_host field to w3creportingapi template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969948 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [16:49:28] (03PS1) 10Ottomata: eventgate-analytics-external - upgrade to nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/970811 (https://phabricator.wikimedia.org/T347477) [16:50:48] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - upgrade to nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/970811 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [16:51:18] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [16:51:20] (03PS1) 10Majavah: hieradata: update cloudvirt-wdqs1002 network config [puppet] - 10https://gerrit.wikimedia.org/r/970812 [16:51:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1114.eqiad.wmnet with OS bullseye [16:51:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10BCornwall) [16:51:47] (03Merged) 10jenkins-bot: eventgate-analytics-external - upgrade to nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/970811 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [16:51:56] (03CR) 10Majavah: [C: 03+2] hieradata: update cloudvirt-wdqs1002 network config [puppet] - 10https://gerrit.wikimedia.org/r/970812 (owner: 10Majavah) [16:52:32] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudvirt-wdqs1002 - taavi@cumin1001" [16:53:31] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [16:53:33] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudvirt-wdqs1002 - taavi@cumin1001" [16:54:10] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lsw1-f1-eqiad.mgmt,ssw1-e1-eqiad.mgmt with reason: replacing optics to troubleshoot errors on core switch link [16:54:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lsw1-f1-eqiad.mgmt,ssw1-e1-eqiad.mgmt with reason: replacing optics to troubleshoot errors on core switch link [16:54:30] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=43516157-a1a8-45c7-82a5-d013fe5b4dda) set by cmooney@cumin1001 for 2:00:00 on 2 host(s) and their services with reason: replacing optics to troubleshoot errors on... [16:56:07] (03PS1) 10Ottomata: eventgate-analytics-external - use 127.0.0.1 in urls instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/970813 (https://phabricator.wikimedia.org/T347477) [16:56:18] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt-wdqs1002.eqiad.wmnet with reason: host reimage [16:56:34] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - use 127.0.0.1 in urls instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/970813 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [16:57:36] (03Merged) 10jenkins-bot: eventgate-analytics-external - use 127.0.0.1 in urls instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/970813 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [16:57:56] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [16:58:25] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [16:58:44] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [16:58:47] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt-wdqs1002.eqiad.wmnet with reason: host reimage [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1700) [17:00:29] (03PS1) 10Ottomata: eventgate-analytics-external - fix typo in schema uris [deployment-charts] - 10https://gerrit.wikimedia.org/r/970814 (https://phabricator.wikimedia.org/T347477) [17:00:45] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - fix typo in schema uris [deployment-charts] - 10https://gerrit.wikimedia.org/r/970814 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [17:01:39] (03Merged) 10jenkins-bot: eventgate-analytics-external - fix typo in schema uris [deployment-charts] - 10https://gerrit.wikimedia.org/r/970814 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [17:01:47] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cp1114.eqiad.wmnet with OS bullseye [17:01:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**... [17:02:08] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [17:02:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1114.eqiad.wmnet with OS bullseye [17:02:27] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [17:02:39] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [17:02:42] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [17:03:03] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:51] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [17:05:17] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [17:08:41] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10cmooney) Optic in ssw1-e1-eqiad et-0/0/8 was replaced, new one now working we should keep an eye and see if it fires again. [17:12:40] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1114.eqiad.wmnet with OS bullseye [17:12:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**... [17:12:51] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [17:13:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1114.eqiad.wmnet with OS bullseye [17:14:30] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [17:14:49] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [17:19:51] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [17:20:40] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [17:20:46] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [17:20:57] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm completed: -... [17:21:23] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on cloudvirt-wdqs1002.eqiad.wmnet with reason: still setting up [17:21:36] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on cloudvirt-wdqs1002.eqiad.wmnet with reason: still setting up [17:21:52] !log taavi@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1002.eqiad.wmnet [17:25:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [17:27:49] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:52] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1002.eqiad.wmnet [17:28:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [17:28:31] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [17:29:18] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1131.eqiad.wmnet - https://phabricator.wikimedia.org/T350141 (10Jclark-ctr) [17:29:32] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1131.eqiad.wmnet - https://phabricator.wikimedia.org/T350141 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [17:29:34] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Jclark-ctr) [17:30:14] (03CR) 10Cathal Mooney: "Good points, I'll split it up / refactor and submit again." [puppet] - 10https://gerrit.wikimedia.org/r/970341 (https://phabricator.wikimedia.org/T350132) (owner: 10Cathal Mooney) [17:30:27] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Jclark-ctr) [17:30:41] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) [17:30:47] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Jclark-ctr) 05Open→03Resolved [17:30:55] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Jclark-ctr) [17:31:05] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Jclark-ctr) [17:31:22] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:46] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T349954 (10Jclark-ctr) a:03Jclark-ctr [17:32:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [17:35:11] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T349954 (10Jclark-ctr) Rebalanced power between branches [17:35:17] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T349954 (10Jclark-ctr) 05Open→03Resolved [17:39:22] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) I believe this is all done. Thank you everyone! [17:40:50] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) 05Open→03Resolved [17:40:56] 10SRE, 10ops-eqiad, 10Goal, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi) [17:41:04] !log taavi@cumin1001 START - Cookbook sre.hosts.remove-downtime for cloudvirt-wdqs1002.eqiad.wmnet [17:41:04] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cloudvirt-wdqs1002.eqiad.wmnet [17:41:27] (03PS1) 10Ssingh: Release dnsdist 1.8.2-1+wmf12u2 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/970819 [17:50:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1114.eqiad.wmnet with OS bullseye [17:50:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1114.eqiad.wmnet with OS bullseye completed: - cp1114 (**PASS**) - Remov... [17:58:12] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10ATsay-WMF) I just needed to see the superset dashboard, and it's working now--thank you! [17:58:22] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:05] dduvall and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1800). [18:00:06] dduvall and dancy: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T1800). [18:01:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10BCornwall) [18:01:58] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [18:02:03] (03PS1) 10Hnowlan: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 [18:02:26] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [18:02:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1115.eqiad.wmnet with OS bullseye [18:02:36] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:47] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970825 (https://phabricator.wikimedia.org/T348356) [18:02:49] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970825 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [18:03:48] (03CR) 10Ssingh: [C: 03+2] Release dnsdist 1.8.2-1+wmf12u2 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/970819 (owner: 10Ssingh) [18:03:52] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970825 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [18:10:17] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.3 refs T348356 [18:10:22] T348356: 1.42.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T348356 [18:14:51] !log reprepro -C component/dnsdist include bookworm-wikimedia dnsdist_1.8.2-1+wmf12u2_amd64.changes [18:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:57] !log dduvall@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.3 refs T348356 (duration: 05m 39s) [18:16:03] T348356: 1.42.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T348356 [18:16:06] !log upgrade doh4001 to dnsdist 1.8.2-1+wmf12u2 [18:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:10] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye [18:17:13] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [18:17:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL**... [18:17:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [18:27:58] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:42] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [18:31:25] (03CR) 10Kosta Harlan: ipoid: Update cronjob definition (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [18:34:53] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [18:35:37] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:24] (03CR) 10Cwhite: [C: 03+1] "Looks good! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/969948 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [18:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:55:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [18:56:38] (03PS1) 10Eevans: cassandra: add grants for new mobileapps tables [puppet] - 10https://gerrit.wikimedia.org/r/970848 (https://phabricator.wikimedia.org/T348993) [18:58:19] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:41] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [19:01:56] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10BCornwall) [19:08:19] (03PS3) 10Herron: logstash: add uri_host field to w3creportingapi template [puppet] - 10https://gerrit.wikimedia.org/r/969948 (https://phabricator.wikimedia.org/T349807) [19:09:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:09:53] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:11:07] (03CR) 10Herron: [C: 03+2] "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/969948 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [19:17:14] (03PS3) 10Cathal Mooney: Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) [19:19:14] (03CR) 10Dzahn: [C: 03+1] "I haven't been around so these changes are somewhat new to me but it seems like a standard thing for you now and with RT I am not concerne" [puppet] - 10https://gerrit.wikimedia.org/r/970724 (owner: 10Muehlenhoff) [19:20:04] (03CR) 10Dzahn: [C: 03+1] "Also happy to merge it and learn myself how that kind of change looks in puppet, just need to fix my prod shell access still." [puppet] - 10https://gerrit.wikimedia.org/r/970724 (owner: 10Muehlenhoff) [19:20:39] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:57] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [19:22:30] (03CR) 10Cathal Mooney: [C: 03+2] Change core router config to export internal routes to Switches (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney) [19:23:09] (03Merged) 10jenkins-bot: Change core router config to export internal routes to Switches [homer/public] - 10https://gerrit.wikimedia.org/r/969367 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney) [19:25:11] (03PS1) 10Majavah: users: add network device access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/970850 (https://phabricator.wikimedia.org/T350267) [19:26:22] (03CR) 10Majavah: users: add network device access for taavi (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/970850 (https://phabricator.wikimedia.org/T350267) (owner: 10Majavah) [19:44:52] !log adjusting routes announced to L3 switches in codfw T344547 [19:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:56] T344547: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 [19:48:00] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T350095 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [19:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:57:27] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [19:58:37] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T2000). [20:00:05] MdsShakil: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:38] Hello :) [20:01:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [20:02:49] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:05] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [20:04:40] hi - i can deploy [20:05:08] (03PS5) 10Clare Ming: Create Draft namespace on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970750 (https://phabricator.wikimedia.org/T350133) (owner: 10MdsShakil) [20:06:18] 10SRE, 10ops-codfw, 10Cassandra, 10decommission-hardware, 10Patch-For-Review: decommission restbase2012.codfw.wmnet - https://phabricator.wikimedia.org/T349526 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [20:06:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970750 (https://phabricator.wikimedia.org/T350133) (owner: 10MdsShakil) [20:06:33] (03CR) 10Muehlenhoff: RT: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/970724 (owner: 10Muehlenhoff) [20:07:03] (03Merged) 10jenkins-bot: Create Draft namespace on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970750 (https://phabricator.wikimedia.org/T350133) (owner: 10MdsShakil) [20:07:29] !log cjming@deploy2002 Started scap: Backport for [[gerrit:970750|Create Draft namespace on bnwiki (T350133)]] [20:07:35] T350133: Create "Draft" namespace on bnwiki - https://phabricator.wikimedia.org/T350133 [20:08:47] !log cjming@deploy2002 mdsshakil and cjming: Backport for [[gerrit:970750|Create Draft namespace on bnwiki (T350133)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:08:49] MdsShakil: can you test? [20:08:56] doing [20:10:54] cjming it's synced yet? [20:11:10] it's on the test servers - can you test there? [20:11:53] can your changes be tested on a testserver? [20:12:31] i can sync once you verify [20:14:22] I am trying to test via special:prefix but their are no change. [20:15:27] Ok, now got it [20:15:43] cjming looks good to me [20:15:52] great - syncing then [20:15:56] !log cjming@deploy2002 mdsshakil and cjming: Continuing with sync [20:21:07] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:970750|Create Draft namespace on bnwiki (T350133)]] (duration: 13m 38s) [20:21:11] T350133: Create "Draft" namespace on bnwiki - https://phabricator.wikimedia.org/T350133 [20:21:29] MdsShakil: should be live! [20:22:30] I have to run so I'm going to close the window since there aren't any more patches in the queue [20:23:16] !log adjusting routes announced to L3 switches in esams T344547 [20:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:20] T344547: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 [20:23:32] !log end of UTC late backport window [20:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:57] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:13] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [20:28:50] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Create Generalised blocking strategy - https://phabricator.wikimedia.org/T270618 (10BCornwall) Thank you for all the work on this ticket and for creating it. I notice that this is a very broad topic and think it would be... [20:32:23] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [20:33:31] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:39:36] 10SRE, 10Traffic-Icebox, 10observability: prometheus-trafficserver-exporter: InsecureRequestWarning - https://phabricator.wikimedia.org/T252993 (10BCornwall) 05Open→03Declined This warning is no long longer applicable since we've moved away from ats-tls. For good measure I verified that the error message... [20:44:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 47.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:48:45] I think we need to run namespaceDupes.php [20:49:01] Anyone around? [20:49:10] MdsShakil: did you add new namespace? [20:49:32] See above T350133 [20:49:32] T350133: Create "Draft" namespace on bnwiki - https://phabricator.wikimedia.org/T350133 [20:49:41] Yes you should have [20:49:53] !log configure esams switches to load-share default across CRs T344547 [20:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:57] T344547: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 [20:50:02] taavi, urbanecm, TheresNoTime: ^ [20:50:12] what's happening? [20:50:33] urbanecm: can you run namespaceDupes on wiki [20:50:37] urbanecm We need to run namespaceDupes.php on bnwiki [20:50:37] Got forgot during B&C [20:50:39] which one? [20:50:42] okay, bnwiki [20:50:47] Bnwiki ye [20:50:59] Autocorrect took BN off - stupid [20:51:17] !log mwmaint2002: mwscript namespaceDupes.php bnwiki --fix --add-prefix BROKEN [20:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:45] urbanecm: can you paste output? [20:51:56] absolutely https://www.irccloud.com/pastebin/GMpTEZl9/ [20:52:42] MdsShakil: can you look into the page at the end [20:53:12] Line 32 of the paste [20:53:18] urbanecm: you're amazing [20:53:30] fortunately, this is a standard operation :) [20:53:54] urbanecm: you're still amazing [20:54:02] thank you ❤️ [20:55:12] RhinosF1 I need to do something here? [20:55:39] (03PS1) 10Cathal Mooney: Enable multipath on L3 switches core BGP group facing CRs [homer/public] - 10https://gerrit.wikimedia.org/r/970860 (https://phabricator.wikimedia.org/T312635) [20:56:04] MdsShakil: move/delete the page mentioned on line 32 to an actual title [20:56:35] (03CR) 10Cathal Mooney: [C: 03+2] Enable multipath on L3 switches core BGP group facing CRs [homer/public] - 10https://gerrit.wikimedia.org/r/970860 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [20:57:11] (03Merged) 10jenkins-bot: Enable multipath on L3 switches core BGP group facing CRs [homer/public] - 10https://gerrit.wikimedia.org/r/970860 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [20:58:41] MdsShakil: I see you have done, have a good evening [20:58:43] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:57] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T2100) [21:02:55] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:09] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [21:14:43] !log adjust BGP policy out to L3 switches on remaining CRs T344547 [21:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:48] T344547: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 [21:25:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [21:28:05] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:19] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [21:32:31] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [21:33:39] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:24] (03PS7) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) [21:52:24] (03PS8) 10Ryan Kemper: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [21:54:28] jouncebot: nowandnext [21:54:28] For the next 0 hour(s) and 5 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231101T2100) [21:54:29] In 8 hour(s) and 5 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T0600) [21:54:29] In 8 hour(s) and 5 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T0600) [21:54:52] (03CR) 10Urbanecm: [C: 03+2] Add খসড়া as draft namespace alias on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970757 (owner: 10MdsShakil) [21:55:44] (03Merged) 10jenkins-bot: Add খসড়া as draft namespace alias on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970757 (owner: 10MdsShakil) [21:56:28] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:970757|Add খসড়া as draft namespace alias on bnwiki]] [21:57:22] (03CR) 10CI reject: [V: 04-1] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [21:57:32] (03PS9) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) [21:57:43] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [21:57:55] !log urbanecm@deploy2002 mdsshakil and urbanecm: Backport for [[gerrit:970757|Add খসড়া as draft namespace alias on bnwiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:00:39] !log urbanecm@deploy2002 mdsshakil and urbanecm: Continuing with sync [22:01:57] (03PS10) 10Ryan Kemper: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [22:03:17] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) 05Open→03Resolved a:03cmooney Patch is merged everywhere. Looks ok. For instance switch in esams connected to backup LVS now sends traffic to pr... [22:05:53] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [22:06:02] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:970757|Add খসড়া as draft namespace alias on bnwiki]] (duration: 09m 34s) [22:06:12] (03CR) 10CI reject: [V: 04-1] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [22:09:16] (03PS11) 10Ryan Kemper: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [22:09:44] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Add new codfw private vlan sub-interfaces to lvs2013 and lvs2014 - https://phabricator.wikimedia.org/T348225 (10cmooney) a:03cmooney [22:13:59] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:14:01] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:15:35] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:15:37] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:16:40] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:18:06] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [22:18:10] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:18:18] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [22:18:55] (03CR) 10Ryan Kemper: [C: 03+1] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [22:18:59] (03CR) 10Bking: [C: 03+2] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [22:21:21] (03PS1) 10MdsShakil: Revert "Add খসড়া as draft namespace alias on bnwiki" and add "খসড়া" by copy-paste from wiki page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970758 [22:24:30] (03PS2) 10MdsShakil: Revert "Add খসড়া as draft namespace alias on bnwiki" and add "খসড়া" by copy-paste from wiki page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970758 [22:26:11] (03PS3) 10MdsShakil: Revert "Add খসড়া as draft namespace alias on bnwiki" and add "খসড়া" by copy-paste from wiki page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970758 [22:27:07] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [22:27:38] (03CR) 10Urbanecm: [C: 03+2] Revert "Add খসড়া as draft namespace alias on bnwiki" and add "খসড়া" by copy-paste from wiki page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970758 (owner: 10MdsShakil) [22:28:19] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:20] (03Merged) 10jenkins-bot: Revert "Add খসড়া as draft namespace alias on bnwiki" and add "খসড়া" by copy-paste from wiki page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/970758 (owner: 10MdsShakil) [22:32:24] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:970758|Revert "Add খসড়া as draft namespace alias on bnwiki" and add "খসড়া" by copy-paste from wiki page]] [22:33:53] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:03] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [22:38:22] (03PS1) 10BCornwall: apache: Redirect sco Wiktionary to sco Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) [22:39:08] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:970758|Revert "Add খসড়া as draft namespace alias on bnwiki" and add "খসড়া" by copy-paste from wiki page]] (duration: 06m 44s) [22:40:30] (03CR) 10CI reject: [V: 04-1] apache: Redirect sco Wiktionary to sco Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) (owner: 10BCornwall) [22:41:26] (03PS2) 10BCornwall: apache: Redirect sco Wiktionary to sco Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) [22:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:43:35] (03CR) 10CI reject: [V: 04-1] apache: Redirect sco Wiktionary to sco Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) (owner: 10BCornwall) [22:50:34] (03PS3) 10BCornwall: apache: Redirect sco Wiktionary to sco Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/970877 (https://phabricator.wikimedia.org/T249648) [22:57:43] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:51] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [22:59:08] 10SRE, 10Traffic-Icebox: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10BCornwall) 05Open→03Resolved a:03BCornwall Since the patch has long since been merged and we're well upgraded, assuming this is fixed. [23:03:19] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:29] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [23:08:22] 10SRE, 10Traffic-Icebox: switch to irate() instead of rate() for traffic graphs - https://phabricator.wikimedia.org/T246902 (10BCornwall) 05Open→03Invalid Thanks for creating this ticket. Unfortunately, this is unable to be serviced due to insufficient information. Please do re-open if it can be given. Tha... [23:15:59] 10SRE, 10Traffic-Icebox: ats-be: consider increasing accept threads or moving accept from dedicated thread to workers - https://phabricator.wikimedia.org/T241233 (10BCornwall) 05Open→03Declined Considering this is four years old and hasn't been visited, this isn't worth keeping around. If we notice such a... [23:17:16] 10SRE, 10Traffic-Icebox: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10BCornwall) 05Open→03Invalid Closing due to age and lack of response. If this is still occurring let's reopen. [23:17:57] 10SRE, 10Traffic-Icebox: Servers freezing across the caching cluster - https://phabricator.wikimedia.org/T238305 (10BCornwall) 05Open→03Stalled @Vgutierrez I'm guessing this is no longer an issue? [23:19:43] 10SRE, 10Traffic-Icebox: Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 (10BCornwall) 05Open→03Invalid Closing due to age and apparent non-recurrence since then. Please re-open if this is still an issue. Thanks! [23:25:58] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-swift-storage, 10Commons, 10Data-Persistence, and 5 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10BCornwall) [23:27:05] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:15] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [23:27:47] 10SRE, 10Traffic-Icebox: Unified certs bloat reduction? - https://phabricator.wikimedia.org/T183554 (10BCornwall) @BBlack Is this something we still want to pursue? I'd fight for moving this discussion to e.g. email or some other discussion medium where there could be some engagement and leaving actionables fo... [23:30:19] 10SRE, 10Traffic-Icebox, 10User-MoritzMuehlenhoff: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742 (10BCornwall) 05Open→03Invalid Since we're using systemd's timesyncd nowadays, so this isn't relevant any more. If I'm wrong, please do re-open. Thanks! [23:32:08] 10SRE, 10Traffic-Icebox, 10observability: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636 (10BCornwall) 05Open→03Resolved a:03BCornwall [23:32:43] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:53] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [23:39:17] 10SRE, 10RESTBase-API, 10Traffic-Icebox: Thumb API: Varnish / CDN questions - https://phabricator.wikimedia.org/T150673 (10BCornwall) 05Open→03Invalid Closing as this was not actionable. For the future, please contact the team at https://wikitech.wikimedia.org/wiki/SRE/Traffic for any questions. Thanks! [23:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:57:59] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:09] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV