[00:00:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:14] (03PS19) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [00:03:41] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [00:04:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:23] (03PS20) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [00:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:29] (03CR) 10Dwisehaupt: "I think this is ready for another round of review." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [00:19:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service,produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967923 [00:39:09] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967923 (owner: 10TrainBranchBot) [00:48:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:50:21] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Jasper) On further investigation, I think the cache has gone bad or something, because when I switch to wifi (same ISP, but different IPv6 addresses), it seems to work properly. For now that will be my workaround, but th... [00:58:52] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Jasper) This would seem to make sense based on https://wikitech.wikimedia.org/wiki/Varnish#Force_your_requests_through_a_specific_Varnish_frontend which says that the hash of my IP is the caching key [01:01:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967923 (owner: 10TrainBranchBot) [01:05:21] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:06:55] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [02:26:07] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:01] (03CR) 10Andrew Bogott: "I'm pretty sure the linter is misfiring now -- see https://gerrit.wikimedia.org/r/c/operations/software/cumin/+/968762 for demonstration" [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [02:31:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:38:41] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:50:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:04:31] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:34:47] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.015e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [03:36:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:51:54] (03PS4) 10Tim Starling: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) [03:52:06] (03CR) 10Tim Starling: Enable LoginNotify seen subnets table (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [04:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:15:45] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:47:13] (03PS2) 10KartikMistry: testwiki: Enable Section translation on some Wikipedias with potential to be supported with MinT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968649 (https://phabricator.wikimedia.org/T345267) [04:48:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [05:05:21] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:06:55] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:12:17] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:53] (03PS3) 10Zoranzoki21: Add throttle rule for Edit-a-Thon on 2023-11-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) [05:20:22] (03CR) 10Zoranzoki21: Add throttle rule for Edit-a-Thon on 2023-11-03 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) (owner: 10Zoranzoki21) [05:20:25] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [05:22:50] (03CR) 10Anzx: [C: 03+1] Add throttle rule for Edit-a-Thon on 2023-11-03 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) (owner: 10Zoranzoki21) [05:39:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:49:27] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:17] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:57:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T0600) [06:00:05] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T0600). [06:08:18] (03CR) 10Ladsgroup: [C: 03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [06:18:31] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:22:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:24:03] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:24:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:30:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.301 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:31:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:34:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:35:19] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:35:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:36:31] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:36:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 7.758 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:39:35] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Update httpd images to pick up the change in glogger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/968620 (owner: 10Giuseppe Lavagetto) [06:40:51] <_joe_> !log rebuilding the base httpd image for mediawiki to pick up glogger changes [06:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:43:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:45:05] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:47:23] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:45] !log installing openssl security updates [06:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:19] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:53:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:57:04] (03CR) 10Jelto: [C: 03+2] miscweb: remove the use of :latest image tag in httpd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [06:58:02] (03Merged) 10jenkins-bot: miscweb: remove the use of :latest image tag in httpd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [07:00:05] Amir1, apergos, and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] morning! There are no trainees signed up to learn the joys of MW deployment today, their loss! But we do have one patch owner with a patch scheduled. kart_ I presume you will be self deploying as usual? [07:02:42] Thanks apergos :) [07:03:00] er, that was a question :-D [07:03:43] ah :) [07:03:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [07:03:52] apergos: Yes. Self deploying.. [07:04:00] okey dokey! [07:04:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968649 (https://phabricator.wikimedia.org/T345267) (owner: 10KartikMistry) [07:05:08] (03Merged) 10jenkins-bot: testwiki: Enable Section translation on some Wikipedias with potential to be supported with MinT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968649 (https://phabricator.wikimedia.org/T345267) (owner: 10KartikMistry) [07:06:09] !log kartik@deploy2002 Started scap: Backport for [[gerrit:968649|testwiki: Enable Section translation on some Wikipedias with potential to be supported with MinT (T345267)]] [07:06:15] T345267: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT - https://phabricator.wikimedia.org/T345267 [07:08:01] !log kartik@deploy2002 kartik: Backport for [[gerrit:968649|testwiki: Enable Section translation on some Wikipedias with potential to be supported with MinT (T345267)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:13:44] !log kartik@deploy2002 kartik: Continuing with sync [07:13:51] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:14:49] (03PS1) 10Jelto: miscweb: remove typo in httpd exporter image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/968946 (https://phabricator.wikimedia.org/T348856) [07:17:00] (03CR) 10JMeybohm: [C: 03+1] miscweb: remove typo in httpd exporter image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/968946 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [07:17:49] (03CR) 10Jelto: [C: 03+2] miscweb: remove typo in httpd exporter image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/968946 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [07:18:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:18:46] (03Merged) 10jenkins-bot: miscweb: remove typo in httpd exporter image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/968946 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [07:19:21] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:968649|testwiki: Enable Section translation on some Wikipedias with potential to be supported with MinT (T345267)]] (duration: 13m 11s) [07:19:26] T345267: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT - https://phabricator.wikimedia.org/T345267 [07:20:57] apergos: I'm done with my patch. [07:21:28] you were it, no one else has snuck something in at the last minute, so I guess we can declare the window closed. [07:21:48] !log UTC morning backport and config window closed [07:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:55] (03PS1) 10Muehlenhoff: Add config-master Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/968950 [07:22:02] have a nice day, see everyone next time! [07:22:29] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:23:05] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [07:23:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.430 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:24:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:25:09] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [07:26:36] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [07:31:36] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [07:32:44] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [07:34:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:34:56] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [07:36:22] (03PS1) 10Muehlenhoff: Add Cumin alias for zookeeper/test [puppet] - 10https://gerrit.wikimedia.org/r/968953 [07:36:23] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [07:37:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.285 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:43:41] 10SRE, 10serviceops: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) [07:46:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:47:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:47:47] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:03] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Joe) [07:49:03] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:21] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.707 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:53:38] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Joe) p:05Triage→03High [07:53:52] (03PS1) 10Giuseppe Lavagetto: mw-jobrunner: add virtualhost explicitly for jobrunning [deployment-charts] - 10https://gerrit.wikimedia.org/r/968955 (https://phabricator.wikimedia.org/T349796) [07:54:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15133 [07:55:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15133 [07:56:48] (03CR) 10Muehlenhoff: [C: 03+2] Add Tyler for approval of various release groups [puppet] - 10https://gerrit.wikimedia.org/r/967899 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [07:58:06] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465 (10MoritzMuehlenhoff) [07:59:14] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: sanitise silence audit log [puppet] - 10https://gerrit.wikimedia.org/r/968615 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [08:00:39] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:02:52] !log restart prometheus k8s k8s-aux - T343529 [08:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:01] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [08:06:21] (03PS1) 10Clément Goubert: mw-debug: Revert envoy draining tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/968959 (https://phabricator.wikimedia.org/T331609) [08:06:27] (03PS1) 10Filippo Giunchedi: team-sre: move NodeTextfileStale to warning and per-team [alerts] - 10https://gerrit.wikimedia.org/r/968960 [08:07:18] !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1008.eqiad.wmnet with OS bullseye [08:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:08:30] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:09:00] (03CR) 10Ilias Sarantopoulos: [C: 03+2] team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [08:09:29] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10ayounsi) 05Open→03Stalled [08:09:48] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10ayounsi) [08:09:54] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) [08:10:12] (03Merged) 10jenkins-bot: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [08:11:35] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Fabfur) Targeting cp4039 instance for upload.wikimedia.org seems to work fine when uploading/viewing contents. I'll keep investigating on this... [08:13:30] (KubernetesAPINotScrapable) resolved: (2) k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:15:15] (03PS1) 10Kevin Bazira: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/968966 (https://phabricator.wikimedia.org/T348607) [08:21:50] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1008.eqiad.wmnet with reason: host reimage [08:22:23] jouncebot: nowandnext [08:22:23] No deployments scheduled for the next 1 hour(s) and 37 minute(s) [08:22:23] In 1 hour(s) and 37 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1000) [08:22:23] In 1 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1000) [08:22:44] (03CR) 10Elukey: [C: 03+1] ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/968966 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [08:24:29] (03CR) 10Filippo Giunchedi: "LGTM! See inline for docs comment. I'm also adding John since this is 'base'" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [08:24:39] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1008.eqiad.wmnet with reason: host reimage [08:24:55] (03CR) 10Urbanecm: [C: 03+2] changeprop: Increase refreshUserImpactJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/968636 (https://phabricator.wikimedia.org/T344428) (owner: 10Urbanecm) [08:25:47] (03Merged) 10jenkins-bot: changeprop: Increase refreshUserImpactJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/968636 (https://phabricator.wikimedia.org/T344428) (owner: 10Urbanecm) [08:27:36] !log urbanecm@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [08:27:55] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:28:32] !log urbanecm@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [08:28:41] !log urbanecm@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [08:29:26] !log urbanecm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [08:29:51] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:29:53] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/968966 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [08:29:57] (03CR) 10Filippo Giunchedi: "I like the idea (in fact I have sth similar in modules/admin/files/home/filippo/.bashrc), drive-by comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [08:30:41] (03Merged) 10jenkins-bot: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/968966 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [08:31:38] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [08:31:39] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Vgutierrez) we definitely have an increase on 503s against https://commons.wikimedia.org/wiki/Special:Upload, we're seeing issues in at least ulsfo, esams and eqsin: {F40299660} [08:32:16] urbanecm: o/ if you see anything weird in changeprop staging it is me testing a new version [08:32:18] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [08:32:40] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:32:41] elukey: thanks for the info! [08:33:51] (03CR) 10Elukey: [C: 03+1] cqlsh-instance (new) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/966913 (owner: 10Eevans) [08:34:08] (03PS3) 10Urbanecm: Growth: Enable new Impact backend everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949034 (https://phabricator.wikimedia.org/T344143) [08:34:20] (03CR) 10Urbanecm: [C: 03+2] Growth: Enable new Impact backend everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949034 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm) [08:35:35] (03Merged) 10jenkins-bot: Growth: Enable new Impact backend everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949034 (https://phabricator.wikimedia.org/T344143) (owner: 10Urbanecm) [08:36:13] (03PS1) 10Jelto: gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) [08:36:39] (03CR) 10CI reject: [V: 04-1] gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [08:38:22] (03PS8) 10Ayounsi: Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) [08:39:17] (03CR) 10Ayounsi: Add helper functions to setup proxy env var (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [08:39:24] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:949034|Growth: Enable new Impact backend everywhere (T344143)]] [08:39:29] T344143: New Impact module: Run backend updating logic on all Wikipedias - https://phabricator.wikimedia.org/T344143 [08:40:16] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1008.eqiad.wmnet with OS bullseye [08:40:41] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:40:48] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:949034|Growth: Enable new Impact backend everywhere (T344143)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:41:44] (03CR) 10Muehlenhoff: [C: 03+2] idp::memcached Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/967951 (owner: 10Muehlenhoff) [08:43:27] !log urbanecm@deploy2002 urbanecm: Continuing with sync [08:43:45] (03PS5) 10JMeybohm: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [08:43:47] (03PS5) 10JMeybohm: Enable icu67 component on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [08:43:49] (03PS5) 10JMeybohm: Enable icu67 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [08:43:51] (03PS2) 10JMeybohm: Revert "Enable icu67 component on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/968660 (https://phabricator.wikimedia.org/T345561) [08:43:53] (03PS1) 10JMeybohm: Enable icu67 component on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/968993 (https://phabricator.wikimedia.org/T345561) [08:45:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:45:33] (03PS1) 10Jelto: gitlab_runner: add token for new authentication scheme [labs/private] - 10https://gerrit.wikimedia.org/r/968996 (https://phabricator.wikimedia.org/T344951) [08:48:21] (03CR) 10Jelto: [V: 03+2 C: 03+2] gitlab_runner: add token for new authentication scheme [labs/private] - 10https://gerrit.wikimedia.org/r/968996 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [08:48:53] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:949034|Growth: Enable new Impact backend everywhere (T344143)]] (duration: 09m 29s) [08:48:58] T344143: New Impact module: Run backend updating logic on all Wikipedias - https://phabricator.wikimedia.org/T344143 [08:49:26] !log mwmaint2002: `foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue` (testing T344428; after enabling backend on all Wikipedias) [08:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:30] T344428: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 [08:50:33] !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1009.eqiad.wmnet with OS bullseye [08:50:39] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:13] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:15] (03CR) 10Muehlenhoff: Enable icu67 component on deployment-prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968993 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [08:54:16] (03CR) 10Btullis: [C: 03+1] data-engineering: eventgate: standardize alerts [alerts] - 10https://gerrit.wikimedia.org/r/959039 (https://phabricator.wikimedia.org/T326002) (owner: 10Gmodena) [08:54:27] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) Still no errors. I've increased job concurrency to 10, enabled new Impact backend on all Wi... [08:54:38] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) a:03Urbanecm_WMF [08:55:12] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:10] (03PS5) 10Gmodena: data-engineering: eventgate: standardize alerts [alerts] - 10https://gerrit.wikimedia.org/r/959039 (https://phabricator.wikimedia.org/T326002) [08:56:12] (03CR) 10Ayounsi: [C: 03+2] Add support for SONiC EthernetX named interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968619 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [08:57:01] (03Merged) 10jenkins-bot: Add support for SONiC EthernetX named interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968619 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [08:57:05] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:00:02] (03PS1) 10Ilias Sarantopoulos: ml-services: remove old rr multilingual form staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/968998 (https://phabricator.wikimedia.org/T347551) [09:00:47] (03PS1) 10Ayounsi: Reduce LibreNMS syslog retention to 15 days [puppet] - 10https://gerrit.wikimedia.org/r/968999 (https://phabricator.wikimedia.org/T349362) [09:01:55] (03CR) 10Volans: add domain param to openstack backend (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [09:02:02] (03CR) 10Volans: "I had already replied about this here:" [software/cumin] - 10https://gerrit.wikimedia.org/r/968762 (owner: 10Andrew Bogott) [09:02:34] (03PS2) 10JMeybohm: Enable icu67 component on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/968993 (https://phabricator.wikimedia.org/T345561) [09:02:36] (03PS6) 10JMeybohm: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [09:02:38] (03PS6) 10JMeybohm: Enable icu67 component on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [09:02:40] (03PS6) 10JMeybohm: Enable icu67 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [09:02:42] (03PS3) 10JMeybohm: Revert "Enable icu67 component on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/968660 (https://phabricator.wikimedia.org/T345561) [09:03:15] (03CR) 10JMeybohm: Enable icu67 component on deployment-prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968993 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:03:18] (03PS2) 10Ilias Sarantopoulos: ml-services: remove unused deployments from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/968998 (https://phabricator.wikimedia.org/T347551) [09:03:28] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [09:03:38] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1009.eqiad.wmnet with reason: host reimage [09:03:56] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [09:04:02] (03CR) 10Marostegui: [C: 03+1] Reduce LibreNMS syslog retention to 15 days [puppet] - 10https://gerrit.wikimedia.org/r/968999 (https://phabricator.wikimedia.org/T349362) (owner: 10Ayounsi) [09:05:12] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:21] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:06:14] (03PS3) 10JMeybohm: Enable icu67 component on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/968993 (https://phabricator.wikimedia.org/T345561) [09:06:16] (03PS3) 10JMeybohm: Enable icu67 component on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) [09:06:18] (03PS7) 10JMeybohm: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [09:06:20] (03PS7) 10JMeybohm: Enable icu67 component on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [09:06:22] (03PS7) 10JMeybohm: Enable icu67 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [09:06:24] (03PS4) 10JMeybohm: Revert "Enable icu67 component on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/968660 (https://phabricator.wikimedia.org/T345561) [09:06:46] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1009.eqiad.wmnet with reason: host reimage [09:06:55] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:07:55] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [09:08:00] (03CR) 10Ayounsi: [C: 03+2] Reduce LibreNMS syslog retention to 15 days [puppet] - 10https://gerrit.wikimedia.org/r/968999 (https://phabricator.wikimedia.org/T349362) (owner: 10Ayounsi) [09:08:17] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [09:11:21] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [09:11:24] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/968959 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [09:12:49] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:13] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/968950 (owner: 10Muehlenhoff) [09:13:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/968953 (owner: 10Muehlenhoff) [09:14:01] (03CR) 10Volans: [C: 03+1] "I wonder if we should have a zookeper-all or something like that. But I'll leave it up to you." [puppet] - 10https://gerrit.wikimedia.org/r/968953 (owner: 10Muehlenhoff) [09:14:05] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [09:14:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:15:11] (03PS9) 10Ayounsi: Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) [09:15:19] (03CR) 10Ayounsi: "Thx!" [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [09:15:34] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/196/con" [puppet] - 10https://gerrit.wikimedia.org/r/968658 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:16:27] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [09:18:15] (03CR) 10JMeybohm: [C: 03+2] Enable icu67 component on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/968993 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:18:20] (03CR) 10JMeybohm: [C: 03+2] Add a Hiera option to enable ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [09:18:24] (03CR) 10JMeybohm: [C: 03+2] Remove deprecated hiera keys from icu63 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/968616 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:18:51] (03CR) 10Elukey: [C: 03+1] ml-services: remove unused deployments from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/968998 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [09:19:33] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: remove unused deployments from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/968998 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [09:19:44] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) 05Open→03Resolved a:03Marostegui Table truncated: `root@db1164:/srv/sqldata/librenms# ls -lh syslog.ibd -rw-rw---- 1 mysql mysql 9.0M Oct 26 09:18... [09:19:54] (03PS1) 10DCausse: cirrus: disable canary events for update & error streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969064 [09:20:10] (03CR) 10Stevemunene: [C: 03+2] Switch druid1006 zookeeper node with druid1011 [puppet] - 10https://gerrit.wikimedia.org/r/965501 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [09:20:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [09:20:27] (03Merged) 10jenkins-bot: ml-services: remove unused deployments from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/968998 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [09:21:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/968960 (owner: 10Filippo Giunchedi) [09:22:13] PROBLEM - Zookeeper Server on druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [09:23:16] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1009.eqiad.wmnet with OS bullseye [09:24:57] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:26:43] (03CR) 10Jbond: "lgtm see comment" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [09:28:17] (03CR) 10Jbond: [C: 03+1] Add config-master Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/968950 (owner: 10Muehlenhoff) [09:28:25] !log depooling and restarting blazegraph on wdqs1009 (stuck since 2023-10-12) [09:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:31] (03CR) 10Jbond: [C: 03+1] Add Cumin alias for zookeeper/test [puppet] - 10https://gerrit.wikimedia.org/r/968953 (owner: 10Muehlenhoff) [09:28:33] (03CR) 10Muehlenhoff: [C: 03+2] Add config-master Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/968950 (owner: 10Muehlenhoff) [09:28:45] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:29:22] !log erratum (replace wdqs1009 with wdqs2009 in the above msg): depooling and restarting blazegraph on wdqs2009 (stuck since 2023-10-12) [09:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:45] (03PS1) 10Brouberol: Inconditionnally install kafka-kit on kafka brokers are they all run on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/969067 [09:32:17] (03PS2) 10Brouberol: Inconditionnally install kafka-kit on kafka brokers are they all run on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/969067 [09:32:34] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 192, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:33:12] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:28] (03PS1) 10Ilias Sarantopoulos: ml-services: remove unused deployments from prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/969068 [09:33:48] (03PS1) 10Muehlenhoff: Remove OS conditional for kafka-kit [puppet] - 10https://gerrit.wikimedia.org/r/969069 [09:33:55] (03CR) 10Btullis: [C: 03+1] Add Cumin alias for zookeeper/test [puppet] - 10https://gerrit.wikimedia.org/r/968953 (owner: 10Muehlenhoff) [09:34:00] (03CR) 10Ayounsi: [C: 03+2] Add helper functions to setup proxy env var [puppet] - 10https://gerrit.wikimedia.org/r/968716 (https://phabricator.wikimedia.org/T278315) (owner: 10Ayounsi) [09:34:15] (03CR) 10CI reject: [V: 04-1] ml-services: remove unused deployments from prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/969068 (owner: 10Ilias Sarantopoulos) [09:34:21] (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre: move NodeTextfileStale to warning and per-team [alerts] - 10https://gerrit.wikimedia.org/r/968960 (owner: 10Filippo Giunchedi) [09:34:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:37:35] ^ expected, silenced the alert for 24h time to catchup [09:39:05] (03CR) 10Filippo Giunchedi: prometheus: Add a default rsyslog destination for all sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [09:39:13] (03PS2) 10Ilias Sarantopoulos: ml-services: remove unused deployments from prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/969068 [09:39:56] (03CR) 10Muehlenhoff: Add Cumin alias for zookeeper/test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968953 (owner: 10Muehlenhoff) [09:41:16] (03PS3) 10Brouberol: Unconditionally install kafka-kit on kafka brokers are they all run on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/969067 [09:41:48] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:28] (03PS2) 10Muehlenhoff: Add Cumin alias for zookeeper/test [puppet] - 10https://gerrit.wikimedia.org/r/968953 [09:44:50] (03CR) 10Muehlenhoff: [C: 03+2] bacula::director: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/963745 (owner: 10Muehlenhoff) [09:45:57] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/968953 (owner: 10Muehlenhoff) [09:46:28] (03PS3) 10Ilias Sarantopoulos: ml-services: remove unused deployments from prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/969068 [09:47:19] (03PS4) 10Ilias Sarantopoulos: ml-services: remove unused deployments from prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/969068 [09:47:50] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:47:55] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [09:48:11] (03CR) 10Vgutierrez: [C: 04-1] "ACLs are missing in this CR so no request would get to the dedicated backend" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [09:49:03] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10User-jbond: global http_proxy setting - https://phabricator.wikimedia.org/T278315 (10ayounsi) 05Open→03Resolved a:03ayounsi Updated the doc https://wikitech.wikimedia.org/wiki/HTTP_proxy I think everything here is done. Pleas... [09:49:49] (03PS1) 10Filippo Giunchedi: opentelemetry-collector: move deployment to ClusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/969071 (https://phabricator.wikimedia.org/T345637) [09:50:05] (NodeTextfileStale) resolved: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:50:24] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for zookeeper/test [puppet] - 10https://gerrit.wikimedia.org/r/968953 (owner: 10Muehlenhoff) [09:50:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969069 (owner: 10Muehlenhoff) [09:50:53] (03PS21) 10Jbond: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [09:52:26] (03CR) 10Jbond: [C: 03+1] "lgtm just some minor things. i also added some simple spec tests" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [09:53:03] (NodeTextfileStale) resolved: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:53:10] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:54] RECOVERY - Query Service HTTP Port on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:57:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:58:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [09:58:30] woot [09:58:35] uh [09:58:45] yo [09:58:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [09:59:17] dns I think [09:59:25] ah no [09:59:37] address_unreachable [09:59:43] yeah, from russia [09:59:53] it has been flapping quite a bit [10:00:04] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1000) [10:01:23] (03Merged) 10jenkins-bot: Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar) [10:03:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [10:06:50] (03CR) 10Jbond: prometheus: Add a default rsyslog destination for all sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [10:07:21] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/969067 (owner: 10Brouberol) [10:09:24] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/966828 (owner: 10PipelineBot) [10:10:14] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/966828 (owner: 10PipelineBot) [10:10:55] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [10:10:59] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:12:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [10:13:17] (03PS2) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) [10:13:41] (03CR) 10Filippo Giunchedi: prometheus: Add a default rsyslog destination for all sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [10:14:57] (03Merged) 10jenkins-bot: Provide current $PATH to the verify script [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/692995 (owner: 10Ppchelko) [10:17:42] (03CR) 10Muehlenhoff: [C: 03+2] backup::host: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/965656 (owner: 10Muehlenhoff) [10:18:47] puppet-merge is failing [10:19:00] there's a syntax error in /etc/profile.d/proxy.sh [10:19:11] "}" unexpected [10:20:06] ^ XioNoX: looks like to be caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/968716/ ? [10:20:16] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [10:20:34] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:21:29] (03PS3) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) [10:22:19] moritzm: how do you reproduce? [10:22:49] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [10:24:12] (03CR) 10CI reject: [V: 04-1] systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [10:24:22] I puppet-merged a patch of mine and only got the following : https://paste.debian.net/1296276/ [10:25:33] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:25:59] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:26:34] (03PS4) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) [10:26:42] I believe the function syntax is a bash feature, but profile.d is also sourced for plain /bin/sh [10:27:01] (03CR) 10CI reject: [V: 04-1] systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [10:27:35] (03PS1) 10Ayounsi: Revert "Add helper functions to setup proxy env var" [puppet] - 10https://gerrit.wikimedia.org/r/968775 [10:27:55] taavi: interesting, and the script uses sh?! [10:28:03] sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/968775 for the revert [10:28:40] wait, will that revert, actually delete the file? [10:29:15] I think for the revert we'll need to stop puppet on puppetmaster* puppetserver* and remove it via Cumin [10:29:28] otherwise the puppet-merge for 968775 will also fail [10:29:58] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:30:08] I had a look at /etc/profile.d/bash_completion.sh and it in fact does additional checks to make sure it only runs under bash [10:30:22] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:30:42] by checking whether $BASH_VERSION is set, so we'll need a similar check for when we re-apply the proxy functions script [10:30:45] (03PS6) 10Fabfur: haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) [10:30:47] (03PS18) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [10:30:59] (03CR) 10Muehlenhoff: [C: 03+1] Revert "Add helper functions to setup proxy env var" [puppet] - 10https://gerrit.wikimedia.org/r/968775 (owner: 10Ayounsi) [10:40:47] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:41:10] (03PS5) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) [10:41:55] (03CR) 10Elukey: [C: 03+1] ml-services: remove unused deployments from prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/969068 (owner: 10Ilias Sarantopoulos) [10:42:19] moritzm: interesting, maybe rolling forward to add that check would be better? [10:42:54] (03PS19) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [10:43:21] XioNoX: you can also just make uit posix compatable [10:43:33] jbond: ah? [10:43:39] one sec [10:45:47] (03PS1) 10Jbond: profile: update script so its compatible with POSIX sh [puppet] - 10https://gerrit.wikimedia.org/r/969074 [10:45:49] XioNoX: see ^^ [10:46:36] (03CR) 10Filippo Giunchedi: "Nice, I like the generalization to unit" [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [10:46:37] jbond: that works the same way? [10:46:45] XioNoX: yes [10:48:06] (03PS1) 10Jbond: environment: fix SC3033 [puppet] - 10https://gerrit.wikimedia.org/r/969075 [10:48:07] jbond: if I start "sh" and paste that it also fails [10:48:19] not sure it's the best way to check though [10:48:23] XioNoX: possiblke due to this https://gerrit.wikimedia.org/r/c/operations/puppet/+/969075 [10:48:29] i was just testng myself [10:49:03] yeah, that works [10:49:04] (03CR) 10Fabfur: haproxy: enable healthcheck-dedicated backend (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [10:49:24] (03CR) 10Ayounsi: [C: 03+1] environment: fix SC3033 [puppet] - 10https://gerrit.wikimedia.org/r/969075 (owner: 10Jbond) [10:49:28] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: remove unused deployments from prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/969068 (owner: 10Ilias Sarantopoulos) [10:49:31] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/201/con" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [10:49:44] (03PS2) 10Jbond: profile: update script so its compatible with POSIX sh [puppet] - 10https://gerrit.wikimedia.org/r/969074 [10:49:47] (03CR) 10Ayounsi: [C: 03+1] "+1 with the next CR in the chain" [puppet] - 10https://gerrit.wikimedia.org/r/969074 (owner: 10Jbond) [10:49:48] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: refreshUserImpactJob logs mysterious fatal errors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [10:49:57] XioNoX: so i have updated it we can either go with the patch which uses _ instead of hyphen [10:50:09] #or we can add the logic to detect if we are in bash andn stick with a - [10:50:16] (03Merged) 10jenkins-bot: ml-services: remove unused deployments from prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/969068 (owner: 10Ilias Sarantopoulos) [10:50:27] jbond: I think better to keep it simple and consistent [10:50:33] (03CR) 10Jbond: [C: 03+2] profile: update script so its compatible with POSIX sh [puppet] - 10https://gerrit.wikimedia.org/r/969074 (owner: 10Jbond) [10:50:41] so use _ [10:50:44] ack ill go with whats there [10:50:53] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: refreshUserImpactJob logs mysterious fatal errors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [10:51:28] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:51:30] (03Abandoned) 10Ayounsi: Revert "Add helper functions to setup proxy env var" [puppet] - 10https://gerrit.wikimedia.org/r/968775 (owner: 10Ayounsi) [10:51:42] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:51:53] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:52:07] (03PS2) 10EoghanGaffney: [systemd/timer] Add optional SuccessExitStatus argument to timer services [puppet] - 10https://gerrit.wikimedia.org/r/968360 (https://phabricator.wikimedia.org/T349166) [10:52:09] (03PS2) 10EoghanGaffney: [quickdatacopy] Add success_exit_status option to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) [10:54:05] (03CR) 10Jbond: prometheus: Add a default rsyslog destination for all sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [10:56:04] (03PS1) 10Jbond: foce merge [puppet] - 10https://gerrit.wikimedia.org/r/969076 [10:56:29] (03CR) 10Jbond: [C: 03+2] foce merge [puppet] - 10https://gerrit.wikimedia.org/r/969076 (owner: 10Jbond) [10:56:35] (03CR) 10Jbond: [V: 03+2 C: 03+2] foce merge [puppet] - 10https://gerrit.wikimedia.org/r/969076 (owner: 10Jbond) [10:57:03] (03CR) 10CI reject: [V: 04-1] [quickdatacopy] Add success_exit_status option to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [10:57:30] XioNoX: moritzm: taavi: should all be fixed now [10:58:03] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:58:49] jbond: great, thanks! [10:58:54] jbond: confirmed that it works fine bash and sh [10:59:32] awesome [11:02:25] (03CR) 10Hnowlan: Revert "restbase: disable per-host icinga checks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967202 (owner: 10Hnowlan) [11:02:27] (03PS6) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) [11:02:29] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [11:02:42] (03Abandoned) 10Hnowlan: Revert "restbase: disable per-host icinga checks" [puppet] - 10https://gerrit.wikimedia.org/r/967202 (owner: 10Hnowlan) [11:03:17] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:04:03] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:04:15] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: refreshUserImpactJob logs mysterious fatal errors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) 05Open→03Resolved Let's be optimistic and call this resolved, since the errors disappeared. I... [11:04:51] (03CR) 10Jbond: "to answer a questions from irc. this CR would not have removed the files as /etc/profile.d is not fully managed i.e. there are files in t" [puppet] - 10https://gerrit.wikimedia.org/r/968775 (owner: 10Ayounsi) [11:05:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/968360 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [11:06:25] (03CR) 10Muehlenhoff: systemd::service: Add service owner parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [11:07:18] (03PS3) 10EoghanGaffney: [quickdatacopy] Add success_exit_status option to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) [11:07:59] (PuppetFailure) firing: Puppet has failed on puppetserver1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:09:05] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [11:10:57] jbond: thx for saving the day! rolling back would have been much more of a pain [11:15:13] (03PS7) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) [11:15:15] (03PS1) 10Jbond: contacts: add new type for WMF sre teams [puppet] - 10https://gerrit.wikimedia.org/r/969082 [11:16:44] (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [11:18:06] (03CR) 10Jbond: [C: 03+1] "lgtm just a minor doc nit" [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [11:18:34] (03CR) 10CI reject: [V: 04-1] systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [11:19:10] (03CR) 10CI reject: [V: 04-1] contacts: add new type for WMF sre teams [puppet] - 10https://gerrit.wikimedia.org/r/969082 (owner: 10Jbond) [11:19:54] (03CR) 10Vgutierrez: [C: 04-1] haproxy: enable healthcheck-dedicated backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [11:26:26] (03PS1) 10KartikMistry: CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) [11:26:54] (03PS1) 10KartikMistry: CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968777 (https://phabricator.wikimedia.org/T348563) [11:32:59] (PuppetFailure) resolved: Puppet has failed on puppetserver1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:34:01] (03PS1) 10Jbond: Revert "puppetserver: Add prometheus to reporters" [puppet] - 10https://gerrit.wikimedia.org/r/968779 [11:34:04] (03PS1) 10Jbond: Revert "P:puppetserver: Add profile to create puppet-prometheus_..." [puppet] - 10https://gerrit.wikimedia.org/r/968780 [11:34:08] (03PS1) 10Jbond: Revert "prometheus_reporter: Add new reporter for providing prom..." [puppet] - 10https://gerrit.wikimedia.org/r/968781 [11:34:27] (03CR) 10CI reject: [V: 04-1] Revert "P:puppetserver: Add profile to create puppet-prometheus_..." [puppet] - 10https://gerrit.wikimedia.org/r/968780 (owner: 10Jbond) [11:34:29] (03CR) 10CI reject: [V: 04-1] Revert "prometheus_reporter: Add new reporter for providing prom..." [puppet] - 10https://gerrit.wikimedia.org/r/968781 (owner: 10Jbond) [11:39:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:39:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:40:33] (03CR) 10Muehlenhoff: [C: 03+2] profile::piwik::database: Enforce type for port [puppet] - 10https://gerrit.wikimedia.org/r/955927 (owner: 10Muehlenhoff) [11:41:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:08] (03PS2) 10Jbond: Revert "puppetserver: Add prometheus to reporters" [puppet] - 10https://gerrit.wikimedia.org/r/968779 [11:41:10] (03PS2) 10Jbond: Revert "P:puppetserver: Add profile to create puppet-prometheus_..." [puppet] - 10https://gerrit.wikimedia.org/r/968780 [11:41:12] (03PS2) 10Jbond: Revert "prometheus_reporter: Add new reporter for providing prom..." [puppet] - 10https://gerrit.wikimedia.org/r/968781 [11:41:33] (03CR) 10Aqu: "Adding Filippo to reviewers" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [11:41:58] (03CR) 10Jbond: [C: 03+2] Revert "puppetserver: Add prometheus to reporters" [puppet] - 10https://gerrit.wikimedia.org/r/968779 (owner: 10Jbond) [11:42:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:42:19] (03CR) 10CI reject: [V: 04-1] CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [11:43:26] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete profile::nftables::basefirewall [puppet] - 10https://gerrit.wikimedia.org/r/965651 (owner: 10Muehlenhoff) [11:43:40] (03CR) 10Jbond: [C: 03+2] Revert "prometheus_reporter: Add new reporter for providing prom..." [puppet] - 10https://gerrit.wikimedia.org/r/968781 (owner: 10Jbond) [11:43:42] (03CR) 10Jbond: [C: 03+2] Revert "P:puppetserver: Add profile to create puppet-prometheus_..." [puppet] - 10https://gerrit.wikimedia.org/r/968780 (owner: 10Jbond) [11:43:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:44:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:32] (03CR) 10KartikMistry: "recheck" [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [11:48:06] (03PS4) 10EoghanGaffney: [quickdatacopy] Add success_exit_status option to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) [11:48:23] (03CR) 10Vgutierrez: [C: 04-1] haproxy: enable healthcheck-dedicated backend (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [11:50:21] (03CR) 10Muehlenhoff: profile::tlsproxy::envoy: Add support for passing nft firewall definitions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [11:50:35] (03CR) 10EoghanGaffney: [quickdatacopy] Add success_exit_status option to rsync::quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [11:53:47] (03CR) 10Jbond: [C: 03+1] [quickdatacopy] Add success_exit_status option to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [11:54:04] (03PS2) 10Jbond: contacts: add new type for WMF sre teams [puppet] - 10https://gerrit.wikimedia.org/r/969082 [11:54:06] (03PS8) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) [11:55:28] (03CR) 10EoghanGaffney: [C: 03+2] [systemd/timer] Add optional SuccessExitStatus argument to timer services [puppet] - 10https://gerrit.wikimedia.org/r/968360 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [11:55:38] (03CR) 10JMeybohm: "LGTM, I've the nodePort has been reserved in https://wikitech.wikimedia.org/w/index.php?title=Kubernetes/Service_ports you might want to r" [deployment-charts] - 10https://gerrit.wikimedia.org/r/969071 (https://phabricator.wikimedia.org/T345637) (owner: 10Filippo Giunchedi) [11:55:45] (03CR) 10JMeybohm: [C: 03+1] opentelemetry-collector: move deployment to ClusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/969071 (https://phabricator.wikimedia.org/T345637) (owner: 10Filippo Giunchedi) [11:55:56] (03CR) 10EoghanGaffney: [C: 03+2] [quickdatacopy] Add success_exit_status option to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) (owner: 10EoghanGaffney) [11:58:49] (03PS3) 10Jbond: contacts: add new type for WMF sre teams [puppet] - 10https://gerrit.wikimedia.org/r/969082 [11:58:51] (03PS9) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1200) [12:00:56] (03CR) 10Brouberol: [C: 03+2] Unconditionally install kafka-kit on kafka brokers are they all run on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/969067 (owner: 10Brouberol) [12:01:53] (03CR) 10KartikMistry: "recheck" [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [12:04:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/969082 (owner: 10Jbond) [12:05:40] (03CR) 10Muehlenhoff: systemd::service: Add service owner parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [12:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:07:26] (03PS20) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [12:20:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:21:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:21:44] (03CR) 10CI reject: [V: 04-1] CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [12:21:54] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:23:04] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/203/con" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [12:24:29] (03PS21) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [12:26:53] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [12:27:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:18] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [12:27:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.541 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:34] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:29:43] (03CR) 10Nik Gkountas: "recheck" [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [12:32:00] (03CR) 10Muehlenhoff: [C: 03+2] profile::tlsproxy::envoy: Add support for passing nft firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [12:37:23] (03CR) 10Brouberol: "At this stage, we only get our PKI to generate a certificate, but we don't override `skein.crt` just yet. I would like to be able to perfo" [puppet] - 10https://gerrit.wikimedia.org/r/968612 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [12:37:40] (03CR) 10Peter Fischer: [C: 03+1] "Good idea, better than breaking the pipeline due to unexpected message payloads." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969064 (owner: 10DCausse) [12:45:56] (03CR) 10DCausse: [C: 03+1] dse-k8s: don't watch rdf-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/966921 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [12:49:25] (03CR) 10KartikMistry: "recheck" [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [12:50:15] (03PS1) 10Muehlenhoff: idp_test: Configure an empty firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/969109 [12:50:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969109 (owner: 10Muehlenhoff) [12:52:11] (03CR) 10Fabfur: haproxy: enable healthcheck-dedicated backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [12:53:52] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/204/con" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [12:59:54] (03CR) 10Brouberol: "Hi! Seems that I had the exact same idea here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/969067" [puppet] - 10https://gerrit.wikimedia.org/r/969069 (owner: 10Muehlenhoff) [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1300). [13:00:06] Kizule, dcausse, and kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] \o [13:00:56] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:06] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:50] * kart_ is here [13:02:30] While Kizule and dcausse's patches are being deployed, should I +2 my wmf backport? [13:02:45] It generally takes 15-20 minutes to merge. [13:04:05] o/ [13:04:26] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:05:11] o/ I can deploy! [13:05:18] * Lucas_WMDE looks at the calendar [13:06:15] thanks! [13:06:23] kart_: yeah I think that’s okay [13:06:50] My patch for AbuseFilter is good to go, as there weren't any negative comments. I'll test it on mwdebug, once it's needed. Throttle rule one is good to go as well without testing on mwdebug. [13:06:56] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:07:25] Kizule: the link in the task description just goes to the top of the page for me, was it already archived? [13:07:40] (03CR) 10Muehlenhoff: Remove OS conditional for kafka-kit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969069 (owner: 10Muehlenhoff) [13:07:48] (03Abandoned) 10Muehlenhoff: Remove OS conditional for kafka-kit [puppet] - 10https://gerrit.wikimedia.org/r/969069 (owner: 10Muehlenhoff) [13:07:55] looks like it ended up at https://sr.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%98%D0%B0:%D0%A2%D1%80%D0%B3/%D0%A2%D0%B5%D1%85%D0%BD%D0%B8%D0%BA%D0%B0#%D0%94%D0%BE%D0%B4%D0%B0%D0%B2%D0%B0%D1%9A%D0%B5_%D0%BC%D0%BE%D0%B3%D1%83%D1%9B%D0%BD%D0%BE%D1%81%D1%82%D0%B8_%D0%B1%D0%BB%D0%BE%D0%BA%D0%B8%D1%80%D1%9A%D0%B0%D1%9A%D0%B0_ [13:07:55] %D0%A4%D0%B8%D0%BB%D1%82%D0%B5%D1%80%D1%83_%D0%BF%D1%80%D0%BE%D1%82%D0%B8%D0%B2_%D0%B7%D0%BB%D0%BE%D1%83%D0%BF%D0%BE%D1%82%D1%80%D0%B5%D0%B1%D0%B5 ? [13:07:57] blegh [13:08:10] but anyway that looks like a lot of green votes to me, so yay [13:08:10] Ah, Serbian Cyrillic and browsers... [13:08:18] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team (Language-2023-October-December): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10KartikMistry) @MatthewVernon What should we do next on this task? Is anything required from the Language team? [13:08:23] Try again: ttps://sr.wikipedia.org/wiki/Википедија:Трг/Техника#Додавање_могућности_блокирњања_Филтеру_против_злоупотребе [13:08:26] https://sr.wikipedia.org/wiki/Википедија:Трг/Техника#Додавање_могућности_блокирњања_Филтеру_против_злоупотребе [13:08:49] (03PS2) 10Lucas Werkmeister (WMDE): Enable block feature for AbuseFilter on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968713 (https://phabricator.wikimedia.org/T349727) (owner: 10Zoranzoki21) [13:08:55] Sorry, I always remove letter H before copy-pasting URLs, in case they have non-latin characters, so these don't get converted to %D0 and blahblah.. [13:09:05] heh, clever [13:09:08] Lucas_WMDE: I'm doing +2 on wmf.1 patch. [13:09:15] Kizule: indeed, that link sends me to the right heading [13:09:18] kart_: ack [13:09:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968713 (https://phabricator.wikimedia.org/T349727) (owner: 10Zoranzoki21) [13:10:11] (03Merged) 10jenkins-bot: Enable block feature for AbuseFilter on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968713 (https://phabricator.wikimedia.org/T349727) (owner: 10Zoranzoki21) [13:10:34] (03CR) 10CI reject: [V: 04-1] CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [13:10:35] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:968713|Enable block feature for AbuseFilter on srwiki (T349727)]] [13:10:48] T349727: Enable block feature for AbuseFilter on srwiki - https://phabricator.wikimedia.org/T349727 [13:10:53] * Lucas_WMDE tries to find out if wmgThrottlingExceptions supports single IPs without any /range [13:11:24] seems like it, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/963025 did the same [13:11:47] (03CR) 10Abijeet Patro: "recheck" [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [13:11:56] !log lucaswerkmeister-wmde@deploy2002 zoranzoki21 and lucaswerkmeister-wmde: Backport for [[gerrit:968713|Enable block feature for AbuseFilter on srwiki (T349727)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:12:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, using a single IP without /range seems to be supported (seen in e.g. I9619d197c2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) (owner: 10Zoranzoki21) [13:12:53] Kizule: can you test the srwiki change? [13:13:07] (not sure if you have IRC configured to be pinged as zoranzoki21 or not ^^) [13:13:27] Lucas_WMDE: I have browser on monitor, so I can see pings. [13:13:34] (03CR) 10KartikMistry: [C: 03+2] CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968777 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [13:13:56] On which mwdebug2* I should test? 2001 or 2002? [13:14:31] any of them [13:14:33] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [13:14:33] Kizule: anyone should be OK [13:14:43] scap backport sends the changes to all of them [13:14:57] (including k8s-experimental) [13:15:08] the days of me manually running scap pull on just one mwdebug host are mostly over ^^ [13:15:22] !log installing poppler security updates [13:15:24] Lucas_WMDE: I can see option for enabling/disabling blocks on Serbian Wikipedia's Special:AbuseFilter, so this is good to go. [13:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:31] (the IRC ping for “synced to the testservers” used to list all of them but it was considered too verbose) [13:15:33] ok! [13:15:35] !log lucaswerkmeister-wmde@deploy2002 zoranzoki21 and lucaswerkmeister-wmde: Continuing with sync [13:17:58] Lucas_WMDE: Amazing! [13:18:18] (about days over) [13:18:24] heh, it is quite convenient :) [13:18:59] (PuppetZeroResources) firing: Puppet has failed generate resources on grafana2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:19:59] (PuppetZeroResources) firing: Puppet has failed generate resources on deploy2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:20:06] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [13:20:15] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:20:58] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:968713|Enable block feature for AbuseFilter on srwiki (T349727)]] (duration: 10m 23s) [13:21:03] T349727: Enable block feature for AbuseFilter on srwiki - https://phabricator.wikimedia.org/T349727 [13:21:07] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [13:21:31] * Lucas_WMDE peeks at zuul [13:21:44] 11 more minutes for the cx backport, I think we can do another config change in that time [13:21:54] (03PS4) 10Lucas Werkmeister (WMDE): Add throttle rule for Edit-a-Thon on 2023-11-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) (owner: 10Zoranzoki21) [13:22:07] Lucas_WMDE: My AbuseFilter patch looks good outside of mwdebug world. [13:22:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) (owner: 10Zoranzoki21) [13:22:23] Kizule: great! [13:22:59] (PuppetZeroResources) firing: Puppet has failed generate resources on centrallog1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:23:07] (03Merged) 10jenkins-bot: Add throttle rule for Edit-a-Thon on 2023-11-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968737 (https://phabricator.wikimedia.org/T349234) (owner: 10Zoranzoki21) [13:23:31] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:968737|Add throttle rule for Edit-a-Thon on 2023-11-03 (T349234)]] [13:23:41] T349234: Temporary Limit Removal Request - Edit-a-Thon on 2023-11-03 - https://phabricator.wikimedia.org/T349234 [13:24:53] !log lucaswerkmeister-wmde@deploy2002 zoranzoki21 and lucaswerkmeister-wmde: Backport for [[gerrit:968737|Add throttle rule for Edit-a-Thon on 2023-11-03 (T349234)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:24:59] nothing to test for this one [13:24:59] (PuppetZeroResources) firing: Puppet has failed generate resources on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:25:00] !log lucaswerkmeister-wmde@deploy2002 zoranzoki21 and lucaswerkmeister-wmde: Continuing with sync [13:26:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [13:26:48] I have a patch coming momentarily for puppet. [13:27:06] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [13:27:10] (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968777 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [13:27:16] (03PS1) 10Abijeet Patro: Remove broken QUnit test [extensions/UniversalLanguageSelector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968783 (https://phabricator.wikimedia.org/T349485) [13:27:18] eoghan: should I pause deploying or something? [13:27:23] (03PS1) 10EoghanGaffney: [quickdatacopy] Fix default selector for ignore_missing_file_errors [puppet] - 10https://gerrit.wikimedia.org/r/969118 [13:27:29] (IIUC puppet changes usually seem to be deployed in parallel without issue) [13:27:31] (03PS1) 10KartikMistry: Remove broken QUnit test [extensions/UniversalLanguageSelector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968784 (https://phabricator.wikimedia.org/T349485) [13:27:37] Lucas_WMDE: I don't think we need that right now [13:27:43] ok :) [13:27:54] (03PS2) 10EoghanGaffney: [quickdatacopy] Fix default selector for ignore_missing_file_errors [puppet] - 10https://gerrit.wikimedia.org/r/969118 [13:27:56] (03CR) 10Jbond: [C: 03+2] contacts: add new type for WMF sre teams [puppet] - 10https://gerrit.wikimedia.org/r/969082 (owner: 10Jbond) [13:28:12] (03CR) 10Jbond: [C: 03+2] systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [13:28:27] (03CR) 10CI reject: [V: 04-1] CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [13:29:36] (03Abandoned) 10KartikMistry: Remove broken QUnit test [extensions/UniversalLanguageSelector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968784 (https://phabricator.wikimedia.org/T349485) (owner: 10KartikMistry) [13:30:14] (03PS1) 10Jbond: rsync::quickdatacopy: check for empty [puppet] - 10https://gerrit.wikimedia.org/r/969119 [13:30:15] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:968737|Add throttle rule for Edit-a-Thon on 2023-11-03 (T349234)]] (duration: 06m 43s) [13:30:22] T349234: Temporary Limit Removal Request - Edit-a-Thon on 2023-11-03 - https://phabricator.wikimedia.org/T349234 [13:30:34] alright, let’s sync the wmf.1 cx change [13:30:39] (cc kart_) [13:30:41] (03CR) 10CI reject: [V: 04-1] rsync::quickdatacopy: check for empty [puppet] - 10https://gerrit.wikimedia.org/r/969119 (owner: 10Jbond) [13:30:56] Lucas_WMDE: Yeah. Patch is merged. [13:31:10] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:968777|CX3 Build 0.2.0+20231026 (T348563 T308836)]] [13:31:15] (03CR) 10Nik Gkountas: "recheck" [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [13:31:23] T348563: CX: Modify query endpoint to fetch translations based on the UI scenario that needs them - https://phabricator.wikimedia.org/T348563 [13:31:24] T308836: Handle session expiration in Section Translation - https://phabricator.wikimedia.org/T308836 [13:31:47] (03PS2) 10Jbond: rsync::quickdatacopy: ignore_missing_file_errors should not be optional [puppet] - 10https://gerrit.wikimedia.org/r/969119 [13:31:49] hm, and the wmf.2 one is always failing with the same CI error? [13:31:52] !log installing curl security updates on buster [13:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:58] (03CR) 10Jbond: [C: 03+2] rsync::quickdatacopy: ignore_missing_file_errors should not be optional [puppet] - 10https://gerrit.wikimedia.org/r/969119 (owner: 10Jbond) [13:32:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [13:32:30] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kartik: Backport for [[gerrit:968777|CX3 Build 0.2.0+20231026 (T348563 T308836)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:32:41] kart_: can you test the change? [13:33:11] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/206/console" [puppet] - 10https://gerrit.wikimedia.org/r/969118 (owner: 10EoghanGaffney) [13:33:16] Lucas_WMDE: sure. [13:33:19] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/205/console" [puppet] - 10https://gerrit.wikimedia.org/r/969118 (owner: 10EoghanGaffney) [13:33:25] thx [13:34:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:36:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969109 (owner: 10Muehlenhoff) [13:39:30] (03CR) 10Muehlenhoff: [C: 03+2] idp_test: Configure an empty firewall_srange [puppet] - 10https://gerrit.wikimedia.org/r/969109 (owner: 10Muehlenhoff) [13:39:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:39:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on deploy2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:40:00] Lucas_WMDE: let's go ahead. It seems working. [13:40:32] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kartik: Continuing with sync [13:40:34] thanks! [13:40:59] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, this should catch the "no matching entry for selector parameter with value '' error"." [puppet] - 10https://gerrit.wikimedia.org/r/969118 (owner: 10EoghanGaffney) [13:41:14] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] [quickdatacopy] Fix default selector for ignore_missing_file_errors [puppet] - 10https://gerrit.wikimedia.org/r/969118 (owner: 10EoghanGaffney) [13:41:23] (03CR) 10Jelto: [quickdatacopy] Fix default selector for ignore_missing_file_errors [puppet] - 10https://gerrit.wikimedia.org/r/969118 (owner: 10EoghanGaffney) [13:41:50] Lucas_WMDE: do we have enough time to include one more patch that will fix CI for my next patch? [13:42:01] jouncebot: next [13:42:01] In 2 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1600) [13:42:06] (03CR) 10Jbond: "i had allready fixed this will revert" [puppet] - 10https://gerrit.wikimedia.org/r/969118 (owner: 10EoghanGaffney) [13:42:09] yeah I think we can overrun a bit [13:42:18] unless somebody tells me not to [13:42:28] Lucas_WMDE: I can also do deploy if you're busy after the window is over. [13:42:30] jbond: Ah, sorry, didn't realise you had a fix. Thanks for catching. [13:42:34] Right. [13:42:36] dcausse: your config change is next :) [13:42:41] :) [13:42:42] (03PS1) 10Jbond: Revert "[quickdatacopy] Fix default selector for ignore_missing_file_errors" [puppet] - 10https://gerrit.wikimedia.org/r/968785 [13:42:42] kart_: I think I’m fine, thanks [13:42:56] (03CR) 10Jbond: [C: 03+2] Revert "[quickdatacopy] Fix default selector for ignore_missing_file_errors" [puppet] - 10https://gerrit.wikimedia.org/r/968785 (owner: 10Jbond) [13:42:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on centrallog1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:43:04] (I narrowly realized just “you’re next” might sound threatening :D) [13:43:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "[quickdatacopy] Fix default selector for ignore_missing_file_errors" [puppet] - 10https://gerrit.wikimedia.org/r/968785 (owner: 10Jbond) [13:43:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on grafana2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:44:55] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [13:44:59] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:44:59] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on deploy2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:45:08] (03PS1) 10Jbond: docker::reporter: route k8s alerts to service ops [puppet] - 10https://gerrit.wikimedia.org/r/969121 (https://phabricator.wikimedia.org/T349176) [13:45:58] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:968777|CX3 Build 0.2.0+20231026 (T348563 T308836)]] (duration: 14m 48s) [13:46:15] T348563: CX: Modify query endpoint to fetch translations based on the UI scenario that needs them - https://phabricator.wikimedia.org/T348563 [13:46:15] T308836: Handle session expiration in Section Translation - https://phabricator.wikimedia.org/T308836 [13:46:42] !log installing cpio security updates [13:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:09] (03PS2) 10Lucas Werkmeister (WMDE): cirrus: disable canary events for update & error streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969064 (owner: 10DCausse) [13:47:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969064 (owner: 10DCausse) [13:47:42] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2140 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/968968 (https://phabricator.wikimedia.org/T349820) [13:47:46] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/968969 (https://phabricator.wikimedia.org/T349820) [13:48:08] (03PS1) 10Jcrespo: mariadb: Remove sustained lag monitoring from misc dbs [puppet] - 10https://gerrit.wikimedia.org/r/969122 [13:48:12] (03PS1) 10Jbond: firewall: allow empty sranges [puppet] - 10https://gerrit.wikimedia.org/r/969123 [13:48:58] (03Merged) 10jenkins-bot: cirrus: disable canary events for update & error streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969064 (owner: 10DCausse) [13:49:22] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:969064|cirrus: disable canary events for update & error streams]] [13:49:55] (03CR) 10Jbond: [C: 03+2] firewall: allow empty sranges [puppet] - 10https://gerrit.wikimedia.org/r/969123 (owner: 10Jbond) [13:50:10] Lucas_WMDE: did my first change deploy fully? :) [13:50:35] (03CR) 10Jcrespo: "It was already not active on some hosts, such as analytics, so this should work just fin." [puppet] - 10https://gerrit.wikimedia.org/r/969122 (owner: 10Jcrespo) [13:50:44] !log lucaswerkmeister-wmde@deploy2002 dcausse and lucaswerkmeister-wmde: Backport for [[gerrit:969064|cirrus: disable canary events for update & error streams]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:51:20] Lucas_WMDE: all good on testservers [13:51:24] thx! [13:51:26] !log lucaswerkmeister-wmde@deploy2002 dcausse and lucaswerkmeister-wmde: Continuing with sync [13:52:50] (03CR) 10JMeybohm: [C: 03+2] Update calculator-service to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967403 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [13:53:37] (03Merged) 10jenkins-bot: Update calculator-service to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967403 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [13:53:48] (03PS2) 10Abijeet Patro: CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [13:54:21] (03PS1) 10Jbond: tlsproxy::envoy: check for NotUndef instead of falsy [puppet] - 10https://gerrit.wikimedia.org/r/969124 [13:56:02] Lucas_WMDE: Should I +2 my next change meanwhile? [13:56:33] kart_: if CI is working out, sure, go ahead [13:56:35] (03CR) 10Jbond: docker::reporter: route k8s alerts to service ops (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/969121 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [13:56:41] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:969064|cirrus: disable canary events for update & error streams]] (duration: 07m 19s) [13:57:06] Lucas_WMDE: It is fix for CI for 3rd patch :) [13:57:20] (03CR) 10KartikMistry: [C: 03+2] Remove broken QUnit test [extensions/UniversalLanguageSelector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968783 (https://phabricator.wikimedia.org/T349485) (owner: 10Abijeet Patro) [13:57:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:57:46] Lucas_WMDE: thanks for the deploy! :) [13:57:58] np :) [13:58:16] (03CR) 10Jbond: [C: 03+2] tlsproxy::envoy: check for NotUndef instead of falsy [puppet] - 10https://gerrit.wikimedia.org/r/969124 (owner: 10Jbond) [13:59:06] bleh, `scap backport` for both changes fails [13:59:15] because the ULS change exists on several branches, I guess [13:59:19] (and was abandoned on one of them) [13:59:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968783 (https://phabricator.wikimedia.org/T349485) (owner: 10Abijeet Patro) [13:59:44] let’s run the ULS one independently then [14:00:24] ah [14:00:29] (03CR) 10Jcrespo: "This was brought up to us by our manager- the alert was too noisy- and while we could actually search if it was etherpad, or bacula, or li" [puppet] - 10https://gerrit.wikimedia.org/r/969122 (owner: 10Jcrespo) [14:01:28] ULS change for CI fix is on wmf.2 only [14:01:55] (03CR) 10Marostegui: [C: 03+1] mariadb: Remove sustained lag monitoring from misc dbs [puppet] - 10https://gerrit.wikimedia.org/r/969122 (owner: 10Jcrespo) [14:02:03] (03PS15) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:02:11] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [14:02:35] (03CR) 10Bking: rdf-streaming-updater: update staging values (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [14:03:04] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: decommission restbase1018 - https://phabricator.wikimedia.org/T349711 (10Jclark-ctr) [14:03:10] (03CR) 10JMeybohm: [C: 03+2] Update similar-users to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967473 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:03:16] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: decommission restbase1018 - https://phabricator.wikimedia.org/T349711 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:03:45] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: decommission restbase1017 - https://phabricator.wikimedia.org/T349710 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:03:56] (03Merged) 10jenkins-bot: Update similar-users to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967473 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:04:17] 10SRE, 10ops-eqiad, 10Cassandra, 10decommission-hardware: decommission restbase1016 - https://phabricator.wikimedia.org/T349709 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:05:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "+2ing ahead of deployment, CI seems to be fixed with the Depends-On" [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [14:08:35] (03PS1) 10Jbond: firewall: Allow ferm::service to accept an empty list for ranges [puppet] - 10https://gerrit.wikimedia.org/r/969125 [14:08:37] (03PS1) 10Jbond: httpbb: set ServiceOps as the team owner for httpbb jobs [puppet] - 10https://gerrit.wikimedia.org/r/969126 [14:08:48] (03PS2) 10Jbond: httpbb: set ServiceOps as the team owner for httpbb jobs [puppet] - 10https://gerrit.wikimedia.org/r/969126 [14:09:01] (03PS16) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:09:02] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/similar-users: apply [14:09:11] (03CR) 10jenkins-bot: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [14:09:27] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/similar-users: apply [14:10:02] hm, why is Zuul not running gate-and-submit for the cx change? [14:10:41] (03PS17) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:10:46] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/similar-users: apply [14:11:19] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/similar-users: apply [14:11:27] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [14:11:30] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/similar-users: apply [14:11:56] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/similar-users: apply [14:12:49] (03CR) 10Filippo Giunchedi: "LGTM overall, see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [14:13:44] (03CR) 10Arnaudb: [C: 03+2] mariadb: Remove sustained lag monitoring from misc dbs [puppet] - 10https://gerrit.wikimedia.org/r/969122 (owner: 10Jcrespo) [14:13:48] (03CR) 10Filippo Giunchedi: "Also I wanted to add: do you have a sample of statsd metrics that will be sent? With that we'll able to tweak/adjust profile::prometheus::" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [14:15:38] Lucas_WMDE: Not sure. Maybe dependent change need to deploy first? [14:15:48] (03Merged) 10jenkins-bot: Remove broken QUnit test [extensions/UniversalLanguageSelector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968783 (https://phabricator.wikimedia.org/T349485) (owner: 10Abijeet Patro) [14:16:04] usually I’d expect it to queue the other change up immediately [14:16:10] Lucas_WMDE: Yeah [14:16:13] (and then potentially not end up merging it if the dependent change doesn’t go through) [14:16:14] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:968783|Remove broken QUnit test (T349485)]] [14:16:16] but let’s try again now [14:16:18] T349485: ULS JavaScript test failure - https://phabricator.wikimedia.org/T349485 [14:16:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] CX3 Build 0.2.0+20231026 (031 comment) [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [14:16:34] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [14:17:23] doesn’t seem to work -.- [14:17:27] :/ [14:17:34] I guess we can just remove the Depends-On? [14:17:36] !log lucaswerkmeister-wmde@deploy2002 abi and lucaswerkmeister-wmde: Backport for [[gerrit:968783|Remove broken QUnit test (T349485)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:17:36] or “comment it out” [14:17:40] Disabled-Depends-On: or whatever [14:17:46] Depended-On: [14:17:47] idk [14:17:51] !log lucaswerkmeister-wmde@deploy2002 abi and lucaswerkmeister-wmde: Continuing with sync [14:18:31] (03PS3) 10KartikMistry: CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) [14:18:54] It is on test-wmf again :/ [14:19:09] Lucas_WMDE: should we force +2 with it? [14:19:37] (03CR) 10Lucas Werkmeister (WMDE): "repeat +2 now that the Depends-On which seemed to confuse Zuul is gone (the change in question, I52b27610c7f753b56e44ed920f1228e716752a7c " [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [14:19:44] I gave it a normal +2, that should be enough I think [14:20:00] OK! [14:23:07] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:968783|Remove broken QUnit test (T349485)]] (duration: 06m 53s) [14:23:12] T349485: ULS JavaScript test failure - https://phabricator.wikimedia.org/T349485 [14:23:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/969125 (owner: 10Jbond) [14:24:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [14:24:27] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [14:25:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/969126 (owner: 10Jbond) [14:30:28] (03CR) 10Volans: "This was not submited by +2ed... did you meant to also submit it?" [cookbooks] - 10https://gerrit.wikimedia.org/r/967629 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:33:56] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bullseye [14:34:01] (03PS1) 10Muehlenhoff: piwik::dataase: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969128 [14:34:02] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye [14:34:10] (03CR) 10Filippo Giunchedi: [C: 03+2] opentelemetry-collector: move deployment to ClusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/969071 (https://phabricator.wikimedia.org/T345637) (owner: 10Filippo Giunchedi) [14:34:40] (03CR) 10CI reject: [V: 04-1] piwik::dataase: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969128 (owner: 10Muehlenhoff) [14:35:39] (03PS2) 10Muehlenhoff: piwik::dataase: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969128 [14:35:52] !log filippo@deploy2002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply [14:36:02] !log filippo@deploy2002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply [14:36:05] (03CR) 10CI reject: [V: 04-1] piwik::dataase: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969128 (owner: 10Muehlenhoff) [14:36:08] !log filippo@deploy2002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply [14:36:12] (03PS1) 10JMeybohm: Add mw-wikifunctions to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/969131 (https://phabricator.wikimedia.org/T347544) [14:36:13] !log filippo@deploy2002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply [14:36:18] !log filippo@deploy2002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply [14:36:27] !log filippo@deploy2002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply [14:37:17] zuul is almost there… almost there… [14:37:50] (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20231026 [extensions/ContentTranslation] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968776 (https://phabricator.wikimedia.org/T348563) (owner: 10KartikMistry) [14:38:12] Lucas_WMDE: now :) [14:38:16] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:968776|CX3 Build 0.2.0+20231026 (T348563 T308836)]] [14:38:22] T348563: CX: Modify query endpoint to fetch translations based on the UI scenario that needs them - https://phabricator.wikimedia.org/T348563 [14:38:22] T308836: Handle session expiration in Section Translation - https://phabricator.wikimedia.org/T308836 [14:38:42] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 42.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:39:36] !log lucaswerkmeister-wmde@deploy2002 kartik and lucaswerkmeister-wmde: Backport for [[gerrit:968776|CX3 Build 0.2.0+20231026 (T348563 T308836)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:39:51] kart_: please test :) [14:40:10] Lucas_WMDE: That's on wmf.2 - right? [14:40:23] yup [14:40:55] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@ff46322]: (no justification provided) [14:41:00] (03PS3) 10Muehlenhoff: piwik::dataase: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969128 [14:42:33] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@ff46322]: (no justification provided) (duration: 01m 38s) [14:44:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:45:59] Lucas_WMDE: one more minute.. [14:46:04] ok :) [14:46:56] Lucas_WMDE: OK. Please deploy.. [14:46:59] !log lucaswerkmeister-wmde@deploy2002 kartik and lucaswerkmeister-wmde: Continuing with sync [14:47:04] * Lucas_WMDE does so [14:47:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add mw-wikifunctions to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/969131 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [14:47:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969128 (owner: 10Muehlenhoff) [14:49:08] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [14:51:21] (03CR) 10Jbond: [C: 03+2] httpbb: set ServiceOps as the team owner for httpbb jobs [puppet] - 10https://gerrit.wikimedia.org/r/969126 (owner: 10Jbond) [14:52:18] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:968776|CX3 Build 0.2.0+20231026 (T348563 T308836)]] (duration: 14m 01s) [14:52:24] T348563: CX: Modify query endpoint to fetch translations based on the UI scenario that needs them - https://phabricator.wikimedia.org/T348563 [14:52:24] T308836: Handle session expiration in Section Translation - https://phabricator.wikimedia.org/T308836 [14:52:53] Thanks a lot, Lucas_WMDE [14:53:37] np [14:53:42] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:48] !log UTC afternoon backport+config window (belatedly) done [14:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:50] (03CR) 10Filippo Giunchedi: prometheus: Add a default rsyslog destination for all sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [14:55:01] sproadic BGP/BFD alerts expected in all sites because of a bunch of restarts related to bird, please ignore, will keep track for the real ones [14:55:06] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:56:26] RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:56:43] s/sproadic/sporadic [14:58:51] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T349830 (10phaultfinder) [15:07:21] (03CR) 10Jforrester: [C: 03+1] Add mw-wikifunctions to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/969131 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [15:09:56] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [15:12:15] (03PS1) 10Herron: prom-es-exporter: w3c-networkerror include uri_host label [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) [15:12:53] (03CR) 10CI reject: [V: 04-1] prom-es-exporter: w3c-networkerror include uri_host label [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [15:12:56] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:57] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [15:15:10] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:15:50] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:16:44] (03PS1) 10Muehlenhoff: profile::arclamp::redis: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969137 [15:17:02] (03PS2) 10Herron: prom-es-exporter: w3c-networkerror include uri_host label [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) [15:17:12] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:17:46] (03CR) 10CI reject: [V: 04-1] prom-es-exporter: w3c-networkerror include uri_host label [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [15:19:14] (03PS2) 10Muehlenhoff: arclamp: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/969137 [15:20:10] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:20] ^ expected [15:20:37] (03CR) 10Jbond: sre.gitlab.*: customize lock arguments (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/967629 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:20:38] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969137 (owner: 10Muehlenhoff) [15:21:21] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [15:22:22] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@4c14785]: (no justification provided) [15:22:28] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:22:30] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:24:50] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:24:52] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:28:42] (03CR) 10Jbond: [C: 03+2] firewall: Allow ferm::service to accept an empty list for ranges [puppet] - 10https://gerrit.wikimedia.org/r/969125 (owner: 10Jbond) [15:28:58] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [15:29:41] (03CR) 10Cwhite: [C: 04-2] ""uri_host" is a scripted field in OpenSearch dashboards and is unknown to OpenSearch. This query will not do what is expected." [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) (owner: 10Herron) [15:30:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:49] !log test add BGP session between ssw1-e1-eqiad and lsw1-e8-eqiad [15:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:03] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) New locations are as follows cloudvirt-wdqs1001 - E 4. U 18. port 35. CableID 70824500012 cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20... [15:32:20] PROBLEM - Zookeeper Server on druid1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [15:35:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:35:43] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@4c14785]: (no justification provided) (duration: 13m 21s) [15:39:11] (03PS1) 10Muehlenhoff: Switch idp_test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/969138 [15:40:37] 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10RobH) [15:40:38] (03PS3) 10Stevemunene: Switch druid1005 zookeeper node with druid1010 [puppet] - 10https://gerrit.wikimedia.org/r/965500 (https://phabricator.wikimedia.org/T336042) [15:41:11] 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10RobH) [15:42:03] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [15:42:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/969138 (owner: 10Muehlenhoff) [15:42:54] !log sudo cumin -b1 -s600 'A:dns-rec and (A:eqiad or A:codfw)' 'systemctl restart ntp.service' [15:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:59] (03CR) 10Stevemunene: [C: 03+2] Switch druid1005 zookeeper node with druid1010 [puppet] - 10https://gerrit.wikimedia.org/r/965500 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [15:48:06] !log hnowlan@deploy2002 Started deploy [restbase/deploy@c461bad]: Adding fonwiki T347940 [15:48:10] T347940: Add fonwiki to RESTBase - https://phabricator.wikimedia.org/T347940 [15:48:14] (03PS1) 10Muehlenhoff: nftables::service: Fix file name variable [puppet] - 10https://gerrit.wikimedia.org/r/969140 [15:48:25] (03PS2) 10Muehlenhoff: nftables::service: Fix file name variable [puppet] - 10https://gerrit.wikimedia.org/r/969140 [15:50:19] (03PS3) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [15:50:59] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [15:52:40] (03PS1) 10JMeybohm: tegola-vector-tiles: Re-enable the envoy admin listener on tcp port [deployment-charts] - 10https://gerrit.wikimedia.org/r/969141 [15:53:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/969121 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [15:55:52] (03PS1) 10Btullis: Tidy up analytics.pp whitespace [puppet] - 10https://gerrit.wikimedia.org/r/969142 [15:55:54] (03PS1) 10Btullis: Enable support for statsd_exporters on non-ops instances [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) [15:56:00] (03CR) 10Jbond: [C: 03+2] docker::reporter: route k8s alerts to service ops [puppet] - 10https://gerrit.wikimedia.org/r/969121 (https://phabricator.wikimedia.org/T349176) (owner: 10Jbond) [15:57:07] (03PS2) 10Btullis: Enable support for statsd_exporters on non-ops instances [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) [15:57:29] (03PS4) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [15:57:34] (03PS2) 10JMeybohm: tegola-vector-tiles: Re-enable the envoy admin listener on tcp port [deployment-charts] - 10https://gerrit.wikimedia.org/r/969141 (https://phabricator.wikimedia.org/T300033) [15:58:10] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [15:58:42] (03PS3) 10JMeybohm: tegola-vector-tiles: Re-enable the envoy admin listener on tcp port [deployment-charts] - 10https://gerrit.wikimedia.org/r/969141 (https://phabricator.wikimedia.org/T300033) [16:00:05] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:56] (03PS3) 10Btullis: Enable support for statsd_exporters on non-ops instances [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) [16:04:59] !log hnowlan@deploy2002 Finished deploy [restbase/deploy@c461bad]: Adding fonwiki T347940 (duration: 16m 53s) [16:05:06] T347940: Add fonwiki to RESTBase - https://phabricator.wikimedia.org/T347940 [16:05:38] (03PS5) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:06:18] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:06:27] (03CR) 10Btullis: "I have made a related patch here:" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [16:07:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:12:00] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/213/console" [puppet] - 10https://gerrit.wikimedia.org/r/969142 (owner: 10Btullis) [16:13:42] (03PS6) 10Effie Mouzeli: ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:14:19] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10RobH) [16:14:24] (03CR) 10CI reject: [V: 04-1] ipoid: Update cronjob definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/966813 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [16:14:39] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10RobH) [16:14:59] (03PS2) 10Brion VIBBER: "Soft-launch" iOS-compatible HLS video transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967531 (https://phabricator.wikimedia.org/T68722) [16:16:06] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10RobH) [16:16:34] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10RobH) [16:16:40] (03CR) 10Jbond: [C: 04-2] "After speaking with valintine we should be able to resolve this using the ssl_trusted_certificate directive" [puppet] - 10https://gerrit.wikimedia.org/r/968748 (owner: 10Jbond) [16:16:41] (03Abandoned) 10Jbond: puppet_ca_certs: We need to have the full chain for client auth [puppet] - 10https://gerrit.wikimedia.org/r/968748 (owner: 10Jbond) [16:17:27] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10RobH) a:03MatthewVernon @MatthewVernon, Per @kofori's recommendation, you would be the best person to provide the hostname and racking details for these 7 new ms-be hosts slated to... [16:17:29] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10RobH) a:03MatthewVernon @MatthewVernon, Per @kofori's recommendation, you would be the best person to provide the hostname and racking details for these 7 new ms-be hosts slated to... [16:18:12] (03CR) 10Jbond: [C: 03+1] nftables::service: Fix file name variable [puppet] - 10https://gerrit.wikimedia.org/r/969140 (owner: 10Muehlenhoff) [16:18:30] !log stevemunene@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [16:22:33] Anybody feel dangerous wanna do a config deploy? :D (Or else review it w/ things to tweak) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/967531 <- soft-launching the new iOS-compatible video tracks (already confirmed working via testwiki over the last couple weeks) [16:26:24] (03CR) 10Btullis: [WIP] Send metrics from Airflow analytics test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [16:35:29] (03CR) 10JMeybohm: [C: 03+2] Add mw-wikifunctions to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/969131 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [16:38:57] (03PS1) 10Isabelle Hurbain-Palatin: Roll-out Parsoid Kartographer support for all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969168 (https://phabricator.wikimedia.org/T342871) [16:39:21] (03PS1) 10Gergő Tisza: OIDC: Return '' instead of null for email in profile [extensions/OAuth] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/969151 (https://phabricator.wikimedia.org/T283456) [16:40:38] (03Merged) 10jenkins-bot: Add mw-wikifunctions to MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/969131 (https://phabricator.wikimedia.org/T347544) (owner: 10JMeybohm) [16:51:49] (03PS1) 10Jbond: systemd::unit: rename unit to name [puppet] - 10https://gerrit.wikimedia.org/r/969171 [16:54:11] (03CR) 10CI reject: [V: 04-1] systemd::unit: rename unit to name [puppet] - 10https://gerrit.wikimedia.org/r/969171 (owner: 10Jbond) [16:54:17] (03PS18) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [16:56:30] (03CR) 10Subramanya Sastry: [C: 03+1] Roll-out Parsoid Kartographer support for all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969168 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [16:56:36] (03PS19) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [16:58:54] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1700) [17:01:26] (03PS2) 10Jbond: systemd::unit: rename unit to name [puppet] - 10https://gerrit.wikimedia.org/r/969171 [17:01:34] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:01:51] !log sudo cumin -b1 -s30 'A:dns-rec and not A:codfw' 'systemctl restart haproxy.service' [17:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:47] (03CR) 10Jbond: [C: 03+2] systemd::unit: rename unit to name [puppet] - 10https://gerrit.wikimedia.org/r/969171 (owner: 10Jbond) [17:06:40] (03PS1) 10Cathal Mooney: Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) [17:06:56] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:07:07] (03PS4) 10Herron: prom-es-exporter: w3c-networkerror include uri_host label [puppet] - 10https://gerrit.wikimedia.org/r/969135 (https://phabricator.wikimedia.org/T349807) [17:11:05] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [17:13:12] PROBLEM - Check systemd state on sretest2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:45] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:19:59] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid public cluster: Roll restart of Druid jvm daemons. [17:22:36] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) >>! In T346948#9284698, @VRiley-WMF wrote: > New locations are as follows > > cloudvirt-wdqs1001 - E 4. U 18. port 35. CableID 70824500012 > > c... [17:22:42] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for Language Diversity Hub steering committee - https://phabricator.wikimedia.org/T349812 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup {{done}} https://lists.wikimedia.org/postorius/lists/langdivhub-com.lists.wikimedia.org/members/owner/ [17:32:32] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [17:35:23] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1001 - taavi@cumin1001" [17:36:11] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudvirt-wdqs1001 - taavi@cumin1001" [17:36:12] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:16] (03PS1) 10Majavah: Revert "Fix puppet on cloudvirt-wdqs* until they have been moved" [puppet] - 10https://gerrit.wikimedia.org/r/969152 [17:39:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:41:05] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [17:42:31] (03PS1) 10Jbond: systemd: Add a way to provide a default team [puppet] - 10https://gerrit.wikimedia.org/r/969177 [17:44:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:44:41] (03CR) 10CI reject: [V: 04-1] systemd: Add a way to provide a default team [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [17:50:02] (03PS2) 10Jbond: systemd: Add a way to provide a default team [puppet] - 10https://gerrit.wikimedia.org/r/969177 [17:50:04] (03PS1) 10Jbond: profile::contacts: update so the _role_contact has not been sanitised [puppet] - 10https://gerrit.wikimedia.org/r/969181 [17:52:16] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) >>! In T346948#9284698, @VRiley-WMF wrote: > cloudvirt-wdqs1002 - F 4. U 19. port 35. CableID 20220058 Thanks! I'm getting a duplicate cable ID ale... [17:52:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [17:53:04] !log sudo cumin -b1 -s300 'A:dns-rec and not A:codfw' 'systemctl restart pdns-recursor.service' [17:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:17] ^ BGP/BFD alerts expected in all sites minus codfw [17:53:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969181 (owner: 10Jbond) [17:53:36] (03PS2) 10Cathal Mooney: Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) [17:55:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/217/con" [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [17:57:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:00:04] dancy and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T1800). [18:00:08] o/ [18:00:33] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969182 (https://phabricator.wikimedia.org/T348355) [18:00:35] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969182 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [18:01:21] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969182 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [18:01:42] o/ [18:07:37] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.2 refs T348355 [18:07:46] T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355 [18:09:54] (03PS3) 10Jbond: systemd: Add a way to provide a default team [puppet] - 10https://gerrit.wikimedia.org/r/969177 [18:10:39] (03CR) 10CI reject: [V: 04-1] systemd: Add a way to provide a default team [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [18:11:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [18:17:08] (03CR) 10Jbond: "This CR will create a lot of drop in files, see pcc[1]. I'm not sure if that would cause and issue. The other option would be too create " [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [18:19:33] (03PS4) 10Jbond: systemd: Add a way to provide a default team [puppet] - 10https://gerrit.wikimedia.org/r/969177 [18:21:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/969177 (owner: 10Jbond) [18:21:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] profile::contacts: update so the _role_contact has not been sanitised [puppet] - 10https://gerrit.wikimedia.org/r/969181 (owner: 10Jbond) [18:27:03] (PuppetFailure) resolved: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:32:20] (03PS1) 10BCornwall: hiera: remove dns2001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969185 (https://phabricator.wikimedia.org/T342154) [18:32:46] (03PS2) 10BCornwall: hiera: remove dns2004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969185 (https://phabricator.wikimedia.org/T342154) [18:33:36] (03PS1) 10Jbond: PCC::clean_reports: allow reports to stick around for 90 days [puppet] - 10https://gerrit.wikimedia.org/r/969186 [18:33:38] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns2004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969185 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:37:07] (03CR) 10Jbond: [C: 03+2] PCC::clean_reports: allow reports to stick around for 90 days [puppet] - 10https://gerrit.wikimedia.org/r/969186 (owner: 10Jbond) [18:40:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:43:04] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:43:16] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:43:44] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:43:46] PROBLEM - Bird Internet Routing Daemon on dns2004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:44:26] ^Reimaging some codfw dns servers! [18:44:42] (03CR) 10Eevans: [V: 03+2 C: 03+2] cqlsh-instance (new) [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/966913 (owner: 10Eevans) [18:44:50] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns2004.wikimedia.org with OS bookworm [18:45:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns2004.wikimedia.org with OS bookworm [18:49:04] (03PS1) 10DDesouza: Deploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969188 (https://phabricator.wikimedia.org/T349854) [18:49:33] (03CR) 10Xcollazo: [C: 03+1] "+1ing this patch due to my testing on the analytics test cluster confirming that the Spark Shufflers work correctly. See T344910#9277895 a" [puppet] - 10https://gerrit.wikimedia.org/r/964008 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [18:49:56] (03CR) 10CI reject: [V: 04-1] Deploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969188 (https://phabricator.wikimedia.org/T349854) (owner: 10DDesouza) [18:51:46] PROBLEM - Host 2620:0:860:2:208:80:153:48 is DOWN: PING CRITICAL - Packet loss = 100% [18:52:47] (03PS2) 10DDesouza: Deploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969188 (https://phabricator.wikimedia.org/T349854) [18:54:50] PROBLEM - Recursive DNS on 208.80.153.48 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:05:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2004.wikimedia.org with reason: host reimage [19:08:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2004.wikimedia.org with reason: host reimage [19:17:07] PROBLEM - Recursive DNS on 208.80.153.48 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:18:09] RECOVERY - Recursive DNS on 208.80.153.48 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [19:28:09] !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt-wdqs1001 [19:28:19] !log taavi@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudvirt-wdqs1001 [19:29:25] !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt-wdqs1001 [19:29:48] !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt-wdqs1001 [19:30:45] !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [19:30:57] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [19:36:25] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:36:49] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 175, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:37:31] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:37:32] (03CR) 10C. Scott Ananian: [C: 03+1] Roll-out Parsoid Kartographer support for all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969168 (https://phabricator.wikimedia.org/T342871) (owner: 10Isabelle Hurbain-Palatin) [19:38:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2004.wikimedia.org with OS bookworm [19:38:47] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns2004.wikimedia.org with OS bookworm completed: - dns2004 (**PASS**) - Downtimed on Icinga/Al... [19:40:00] 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [19:40:20] 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [19:41:56] !log taavi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [19:42:10] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm executed with... [19:43:04] !log taavi@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirt-wdqs1001.mgmt.eqiad.wmnet with reboot policy FORCED [19:45:21] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10VRiley-WMF) Verified the switch. It is WMF10822 - 4BN3SR3 [19:45:59] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:46:33] 10SRE, 10ops-eqiad, 10DC-Ops: Audit of WMCS Servers Using Single & Dual Switchports - https://phabricator.wikimedia.org/T349756 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [19:48:22] (03PS1) 10BCornwall: Revert "hiera: remove dns2004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969153 [19:48:51] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns2004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969153 (owner: 10BCornwall) [19:54:40] (03PS1) 10BCornwall: hiera: remove dns2005 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969194 (https://phabricator.wikimedia.org/T342154) [19:56:18] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns2005 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969194 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [19:59:24] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt-wdqs1001.mgmt.eqiad.wmnet with reboot policy FORCED [19:59:50] !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [20:00:03] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [20:00:04] brennen and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231026T2000). [20:00:05] tgr, bvibber, and danisztls: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] \o/ [20:01:09] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:01:21] brennen: I can't deploy, are you available? [20:01:24] ^reimaging dns server [20:02:11] o/ [20:02:56] i can deploy [20:03:35] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bookworm [20:03:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns2005.wikimedia.org with OS bookworm [20:04:47] tgr: about? [20:05:17] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:05:53] ^reimaging dns server [20:07:05] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:07:09] bvibber: guess we'll start with yours [20:07:17] woohoo! [20:07:53] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:08:11] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:10:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967531 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER) [20:11:03] PROBLEM - Host 2620:0:860:3:208:80:153:74 is DOWN: PING CRITICAL - Packet loss = 100% [20:11:18] \o\ /o/ \o\ \o/ [20:11:37] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: host reimage [20:11:50] (03Merged) 10jenkins-bot: "Soft-launch" iOS-compatible HLS video transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967531 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER) [20:12:06] !log brennen@deploy2002 Started scap: Backport for [[gerrit:967531|"Soft-launch" iOS-compatible HLS video transcodes (T68722)]] [20:12:11] T68722: [iOS app] Some media (esp. video) files do not work - https://phabricator.wikimedia.org/T68722 [20:12:18] woo [20:13:24] !log brennen@deploy2002 brennen and brion: Backport for [[gerrit:967531|"Soft-launch" iOS-compatible HLS video transcodes (T68722)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:25] PROBLEM - Recursive DNS on 208.80.153.74 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:14:00] brennen: confirmed working on mwdebug2001 :D [20:14:40] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: host reimage [20:15:00] bvibber: cool, going ahead. [20:15:04] !log brennen@deploy2002 brennen and brion: Continuing with sync [20:15:11] whee [20:19:57] danisztls: around? [20:19:58] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2005.wikimedia.org with reason: host reimage [20:20:36] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:967531|"Soft-launch" iOS-compatible HLS video transcodes (T68722)]] (duration: 08m 29s) [20:20:40] T68722: [iOS app] Some media (esp. video) files do not work - https://phabricator.wikimedia.org/T68722 [20:21:35] brennen: and confirmed deployed. thanks! [20:21:48] i can run some background conversions now woooooooooo :D [20:22:02] bvibber: sure thing [20:22:26] brennen: yep [20:23:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2005.wikimedia.org with reason: host reimage [20:24:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969188 (https://phabricator.wikimedia.org/T349854) (owner: 10DDesouza) [20:25:03] (03Merged) 10jenkins-bot: Deploy pilot survey on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969188 (https://phabricator.wikimedia.org/T349854) (owner: 10DDesouza) [20:25:16] !log brennen@deploy2002 Started scap: Backport for [[gerrit:969188|Deploy pilot survey on metawiki (T349854)]] [20:25:21] T349854: Deploy pilot survey on metawiki - https://phabricator.wikimedia.org/T349854 [20:26:28] !log brennen@deploy2002 dani and brennen: Backport for [[gerrit:969188|Deploy pilot survey on metawiki (T349854)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:26:33] brennen: here, sorry [20:27:01] (03PS2) 10Majavah: Revert "Fix puppet on cloudvirt-wdqs* until they have been moved" [puppet] - 10https://gerrit.wikimedia.org/r/969152 [20:27:02] danisztls: on test servers [20:27:03] (03PS1) 10Majavah: hieradata: update interface names for cloudvirt-wdqs1001 [puppet] - 10https://gerrit.wikimedia.org/r/969199 [20:28:17] (03CR) 10Brennen Bearnes: [C: 03+2] OIDC: Return '' instead of null for email in profile [extensions/OAuth] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/969151 (https://phabricator.wikimedia.org/T283456) (owner: 10Gergő Tisza) [20:28:35] tgr: cool, +2ing this one, will get it going as soon as finished with danisztls's patch [20:28:50] brennen: all good [20:28:55] cool, continuing [20:29:00] !log brennen@deploy2002 dani and brennen: Continuing with sync [20:29:41] (03PS1) 10Majavah: prometheus: ipmi_exporter: add dependency on package [puppet] - 10https://gerrit.wikimedia.org/r/969201 [20:30:14] (03CR) 10Majavah: [C: 03+2] Revert "Fix puppet on cloudvirt-wdqs* until they have been moved" [puppet] - 10https://gerrit.wikimedia.org/r/969152 (owner: 10Majavah) [20:30:23] (03CR) 10Majavah: [C: 03+2] hieradata: update interface names for cloudvirt-wdqs1001 [puppet] - 10https://gerrit.wikimedia.org/r/969199 (owner: 10Majavah) [20:30:49] brennen: thanks!! [20:31:06] !log brion running video transcode backfill via mwmaint2002 (requeueTranscodes.php) + job queue [20:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:36] sure thing. [20:33:25] (03Merged) 10jenkins-bot: OIDC: Return '' instead of null for email in profile [extensions/OAuth] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/969151 (https://phabricator.wikimedia.org/T283456) (owner: 10Gergő Tisza) [20:34:12] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:969188|Deploy pilot survey on metawiki (T349854)]] (duration: 08m 56s) [20:34:17] T349854: Deploy pilot survey on metawiki - https://phabricator.wikimedia.org/T349854 [20:34:28] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudvirt-wdqs1001 - taavi@cumin1001" [20:34:35] thanks brennen! does not need to be tested, patch is trivial [20:34:58] !log brennen@deploy2002 Started scap: Backport for [[gerrit:969151|OIDC: Return '' instead of null for email in profile (T283456)]] [20:35:05] T283456: OAuth identfy endpoint should not expose unconfirmed email address - https://phabricator.wikimedia.org/T283456 [20:35:33] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudvirt-wdqs1001 - taavi@cumin1001" [20:36:12] tgr: ack, will proceed with sync. [20:36:12] !log brennen@deploy2002 brennen and tgr: Backport for [[gerrit:969151|OIDC: Return '' instead of null for email in profile (T283456)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:37:11] !log brennen@deploy2002 brennen and tgr: Continuing with sync [20:38:24] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:38:34] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 175, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:38:50] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:38:56] ^reimaging dns server [20:41:52] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2005.wikimedia.org with OS bookworm [20:42:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns2005.wikimedia.org with OS bookworm completed: - dns2005 (**PASS**) - Downtimed on Icinga/Al... [20:42:24] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:969151|OIDC: Return '' instead of null for email in profile (T283456)]] (duration: 07m 25s) [20:42:29] T283456: OAuth identfy endpoint should not expose unconfirmed email address - https://phabricator.wikimedia.org/T283456 [20:42:43] !log end of utc late backport & config window [20:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:44] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1001" [20:45:27] 10SRE: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671 (10Jasper) >>! In T349671#9283152, @Fabfur wrote: > Targeting cp4039 instance for upload.wikimedia.org seems to work fine when uploading/viewing contents. > I'll keep investigating on this... To be clear, the problem does... [20:45:34] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1001" [20:45:39] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm [20:45:52] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bookworm completed: - c... [20:49:42] (03PS22) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [20:54:11] (03CR) 10Dwisehaupt: "Thanks for those spec tests. I have made some updates and I think this is good for additional review by Jeff and/or Jesse." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [21:00:48] !log taavi@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1001.eqiad.wmnet [21:06:56] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:08:30] (03PS1) 10Majavah: hieradata: fix VLAN names for cloudvirt-wdqs [puppet] - 10https://gerrit.wikimedia.org/r/969206 [21:08:45] (03CR) 10CI reject: [V: 04-1] hieradata: fix VLAN names for cloudvirt-wdqs [puppet] - 10https://gerrit.wikimedia.org/r/969206 (owner: 10Majavah) [21:09:15] (03PS2) 10Majavah: hieradata: fix VLAN names for cloudvirt-wdqs [puppet] - 10https://gerrit.wikimedia.org/r/969206 [21:09:21] (03CR) 10Majavah: [C: 03+2] hieradata: fix VLAN names for cloudvirt-wdqs [puppet] - 10https://gerrit.wikimedia.org/r/969206 (owner: 10Majavah) [21:09:23] (03CR) 10Majavah: [V: 03+2 C: 03+2] hieradata: fix VLAN names for cloudvirt-wdqs [puppet] - 10https://gerrit.wikimedia.org/r/969206 (owner: 10Majavah) [21:12:02] !log taavi@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudvirt-wdqs1001.eqiad.wmnet [21:12:38] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:16:30] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: still trying to get nova to schedule hosts there [21:16:44] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: still trying to get nova to schedule hosts there [21:16:46] (03PS1) 10BCornwall: Revert "hiera: remove dns2005 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969154 [21:17:26] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns2005 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969154 (owner: 10BCornwall) [21:20:26] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Majavah still trying to figure out why Nova is not scheduling anything there https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:32:17] (03PS1) 10Ebernhardson: cirrus updater: Re-enable the .* route for mwapi [deployment-charts] - 10https://gerrit.wikimedia.org/r/969209 [21:33:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:02] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:46] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:36] (03PS1) 10Ebernhardson: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/969211 [21:39:53] (03PS1) 10BCornwall: hiera: remove dns2006 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969212 (https://phabricator.wikimedia.org/T342154) [21:40:46] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns2006 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/969212 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [21:42:08] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/969211 (owner: 10Ebernhardson) [21:42:51] (03Merged) 10jenkins-bot: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/969211 (owner: 10Ebernhardson) [21:43:12] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:45:13] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:45:23] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:46:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:46:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:47:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:47:49] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns2006.wikimedia.org with OS bookworm [21:48:00] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns2006.wikimedia.org with OS bookworm [21:49:58] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:49:58] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:50:26] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:55:21] ^reimaging dns server [21:56:02] PROBLEM - Host 2620:0:860:4:208:80:153:107 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:4:208:80:153:107) [21:57:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:59:06] PROBLEM - Recursive DNS on 208.80.153.107 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [22:03:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:07:35] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2006.wikimedia.org with reason: host reimage [22:08:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:10:48] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2006.wikimedia.org with reason: host reimage [22:19:41] PROBLEM - Recursive DNS on 208.80.153.107 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [22:20:07] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10RobH) [22:20:39] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10RobH) [22:20:47] RECOVERY - Recursive DNS on 208.80.153.107 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [22:21:30] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10RobH) a:03Clement_Goubert >>! In T348045#9281139, @Kappakayala wrote: > @Clement_Goubert / @Joe could one of you help with the racking details? Would one of you be so kind as to update... [22:22:22] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10RobH) [22:22:41] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10RobH) [22:23:20] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10RobH) a:03Clement_Goubert >>! In T348046#9281141, @Kappakayala wrote: > @Clement_Goubert / @Joe could one of you help with the racking details? Would one of you be so kind as to update... [22:26:03] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10RobH) [22:26:23] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10RobH) [22:27:25] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10RobH) a:03Clement_Goubert >>! In T348021#9281147, @Kappakayala wrote: > @Clement_Goubert / @Joe could one of you help with the racking details? I've split the racking task onto it... [22:28:00] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349876 (10RobH) [22:28:05] 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10RobH) [22:28:15] 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10RobH) [22:29:31] 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349876 (10RobH) a:03Clement_Goubert >>! In T348020#9281144, @Kappakayala wrote: > @Clement_Goubert / @Joe could one of you help with the racking details? I've split the racking task onto... [22:29:39] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349876 (10RobH) [22:48:55] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:49:11] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:49:31] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2006.wikimedia.org with OS bookworm [22:49:42] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns2006.wikimedia.org with OS bookworm completed: - dns2006 (**PASS**) - Downtimed on Icinga/Al... [22:49:49] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 175, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:51:04] (03PS1) 10BCornwall: Revert "hiera: remove dns2006 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969155 [22:53:06] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns2006 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/969155 (owner: 10BCornwall) [23:01:33] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [23:02:47] 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10RobH) [23:03:01] 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10RobH) [23:07:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:30:10] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:35:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:45:59] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure