[00:00:31] RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:32] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981632 (owner: 10Ebernhardson) [00:01:03] RECOVERY - Check systemd state on centrallog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:37] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [00:01:48] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:21:32] (03CR) 10Majavah: [V: 03+2 C: 03+2] secret: dkim: move wmcs dkim keys to correct location [labs/private] - 10https://gerrit.wikimedia.org/r/969690 (owner: 10Majavah) [00:22:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 8h 12m 40s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [00:26:12] (03PS3) 10Majavah: hieradata: fix cloudinfra webproxy password location [labs/private] - 10https://gerrit.wikimedia.org/r/969689 [00:26:18] (03PS3) 10Majavah: hieradata: add fake metricsinfra grafana password [labs/private] - 10https://gerrit.wikimedia.org/r/969691 [00:29:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye [00:29:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [00:29:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [00:29:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye [00:29:55] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2004.codfw.wmnet with OS bullseye [00:29:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2006.codfw.wmnet with OS bullseye [00:29:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2005.codfw.wmnet with OS bullseye [00:30:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye executed with errors: -... [00:30:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye [00:30:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [00:30:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye [00:30:41] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye [00:30:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2004.codfw.wmnet with OS bullseye [00:30:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2004.codfw.wmnet with OS bullseye executed with errors: -... [00:30:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2005.codfw.wmnet with OS bullseye [00:30:57] (03PS1) 10Majavah: P:mail: use wmcloud.org instead of wmflabs.org in envelopes [puppet] - 10https://gerrit.wikimedia.org/r/981635 [00:31:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors: -... [00:31:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [00:31:35] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2004.codfw.wmnet with OS bullseye [00:31:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [00:33:45] (03PS2) 10Majavah: P:mail: use wmcloud.org instead of wmflabs.org in envelopes [puppet] - 10https://gerrit.wikimedia.org/r/981635 [00:34:36] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/856/con" [puppet] - 10https://gerrit.wikimedia.org/r/981635 (owner: 10Majavah) [00:35:41] (03CR) 10Majavah: [V: 03+2 C: 03+2] hieradata: fix cloudinfra webproxy password location [labs/private] - 10https://gerrit.wikimedia.org/r/969689 (owner: 10Majavah) [00:37:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 7h 29m 19s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [00:38:48] (03CR) 10Majavah: [V: 03+2 C: 03+2] hieradata: add fake metricsinfra grafana password [labs/private] - 10https://gerrit.wikimedia.org/r/969691 (owner: 10Majavah) [00:38:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981430 [00:38:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981430 (owner: 10TrainBranchBot) [00:39:19] PROBLEM - Check systemd state on centrallog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:17] PROBLEM - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2006.codfw.wmnet with reason: host reimage [00:48:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage [00:48:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage [00:50:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on sessionstore2005.codfw.wmnet with reason: host reimage [00:50:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2006.codfw.wmnet with reason: host reimage [00:53:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage [00:57:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/981430 (owner: 10TrainBranchBot) [01:13:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd2002.codfw.wmnet with OS bullseye [01:13:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce... [01:17:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) @Jhancock.wm oin 2002 try to check network possible re-run the switch config cookbook [01:29:43] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 236029 MB (6% inode=91%): /srv/swift-storage/sdg1 217598 MB (5% inode=91%): /srv/swift-storage/sdh1 198174 MB (5% inode=91%): /srv/swift-storage/sdc1 214432 MB (5% inode=91%): /srv/swift-storage/sde1 215968 MB (5% inode=91%): /srv/swift-storage/sdd1 214549 MB (5% inode=91%): /srv/swift-storage/sdj1 215732 MB (5% inode=92%): /srv/swift-st [01:29:43] l1 183225 MB (4% inode=91%): /srv/swift-storage/sdi1 170268 MB (4% inode=92%): /srv/swift-storage/sdk1 200329 MB (5% inode=91%): /srv/swift-storage/sdm1 217953 MB (5% inode=92%): /srv/swift-storage/sdn1 152141 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [01:36:43] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Kappakayala) [01:44:26] (03PS1) 10Pols12: Make wiktionary and mw.org provide og:site_name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) [02:16:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:39:07] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:19] (03CR) 10Jdlrobson: [C: 04-1] "With the changes, you can safely deploy this on Wednesday given that's when we will have deployed to mediawiki.org and wiktionary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981636 (https://phabricator.wikimedia.org/T348203) (owner: 10Pols12) [03:00:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:09:07] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:31:07] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:33:15] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: geoip_update_main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:35] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: geoip_update_main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:58] 10SRE-swift-storage, 10UploadWizard: Internal error: The server could not save the temporary file - https://phabricator.wikimedia.org/T353068 (10Aklapper) [03:38:27] PROBLEM - Check unit status of geoip_update_main on puppetserver1001 is CRITICAL: CRITICAL: Status of the systemd unit geoip_update_main https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:41:09] PROBLEM - Check unit status of geoip_update_main on puppetmaster1001 is CRITICAL: CRITICAL: Status of the systemd unit geoip_update_main https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:28:12] (03PS1) 10Andrew Bogott: Revert "vendordata: only wipe out puppet certs if we aren't building a base image" [puppet] - 10https://gerrit.wikimedia.org/r/981200 [06:28:51] (03CR) 10Andrew Bogott: [C: 03+2] Revert "vendordata: only wipe out puppet certs if we aren't building a base image" [puppet] - 10https://gerrit.wikimedia.org/r/981200 (owner: 10Andrew Bogott) [06:37:21] (03PS1) 10Andrew Bogott: Revert "nova vendor-data: 3rd attempt to read 'install_puppet' metadata" [puppet] - 10https://gerrit.wikimedia.org/r/981201 [06:37:30] (03PS1) 10Andrew Bogott: Revert "nova vendor-data: 2nd attempt to read 'install_puppet' metadata" [puppet] - 10https://gerrit.wikimedia.org/r/981202 [06:37:51] (03CR) 10CI reject: [V: 04-1] Revert "nova vendor-data: 2nd attempt to read 'install_puppet' metadata" [puppet] - 10https://gerrit.wikimedia.org/r/981202 (owner: 10Andrew Bogott) [06:39:51] 10SRE-swift-storage, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10Marostegui) [06:41:00] 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10Marostegui) [06:42:31] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova vendor-data: 3rd attempt to read 'install_puppet' metadata" [puppet] - 10https://gerrit.wikimedia.org/r/981201 (owner: 10Andrew Bogott) [06:42:45] (03PS2) 10Andrew Bogott: Revert "nova vendor-data: 2nd attempt to read 'install_puppet' metadata" [puppet] - 10https://gerrit.wikimedia.org/r/981202 [06:43:23] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova vendor-data: 2nd attempt to read 'install_puppet' metadata" [puppet] - 10https://gerrit.wikimedia.org/r/981202 (owner: 10Andrew Bogott) [07:00:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:17:32] 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10Marostegui) 1002 seems to have the same issue [07:19:35] 10SRE-swift-storage, 10Observability-Metrics, 10Grafana: Disk space thanos-be1001:9100 alert - https://phabricator.wikimedia.org/T353091 (10Marostegui) 2002 too [07:56:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:57:11] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:57:53] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:01:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:01:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:02:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:14:45] 10SRE, 10Observability-Metrics, 10Goal, 10Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10elukey) Hi @colewhite! I worked with James to port the Recommendation-api to nodejs 18, and one of the patches that we merged is: https://gerrit.wikimedia.org/r/c/m... [09:52:00] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:06:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:00:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:14:59] (03PS5) 10Slyngshede: Move Debmonitor client code to separate repository. [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 [11:15:12] (03CR) 10Slyngshede: Move Debmonitor client code to separate repository. (035 comments) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/981463 (owner: 10Slyngshede) [12:26:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:45] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:45:43] (03PS1) 10AikoChou: ml-services: test kserve batcher for revertrisk-la in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981646 (https://phabricator.wikimedia.org/T348536) [13:36:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:37:43] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:39:08] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:08] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:03] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:59:31] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:00:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:49:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2005.codfw.wmnet with OS bullseye [15:49:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors: -... [15:51:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2006.codfw.wmnet with OS bullseye [15:51:30] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye executed with errors: -... [15:53:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2004.codfw.wmnet with OS bullseye [15:53:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2004.codfw.wmnet with OS bullseye executed with errors: -... [17:13:19] PROBLEM - Disk space on relforge1003 is CRITICAL: DISK CRITICAL - free space: / 7377 MB (10% inode=97%): /tmp 7377 MB (10% inode=97%): /var/tmp 7377 MB (10% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=relforge1003&var-datasource=eqiad+prometheus/ops [18:09:47] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:12:47] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:16:37] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:21:15] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:00:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:13:11] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:39] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:29:13] (DiskSpace) firing: Disk space relforge1003:9100:/ 5.511% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:32:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:42:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:33:57] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sun 17 Dec 2023 03:07:37 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:35:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:45] (SystemdUnitFailed) firing: prometheus-dpkg-success-textfile.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:38:23] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-dpkg-success-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:44] (03PS12) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [22:58:46] (03PS9) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [23:00:28] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:02:20] (03PS10) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [23:29:28] (DiskSpace) firing: Disk space relforge1003:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=relforge1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace