[00:00:54] (03CR) 10Cwhite: recommendation-api: update statsd configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [00:05:49] (03CR) 10JHathaway: [C: 03+2] pki: fix rename of intermediates [puppet] - 10https://gerrit.wikimedia.org/r/983504 (https://phabricator.wikimedia.org/T282308) (owner: 10JHathaway) [00:08:52] (03CR) 10JHathaway: [C: 03+2] "thanks for the review moritz" [puppet] - 10https://gerrit.wikimedia.org/r/983462 (https://phabricator.wikimedia.org/T353467) (owner: 10JHathaway) [00:12:06] 10SRE, 10SRE-Access-Requests: Replace Kbrown's old ssh public key with a new one - https://phabricator.wikimedia.org/T353467 (10jhathaway) 05Open→03Resolved a:03jhathaway merged, enjoy! [00:18:26] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [00:36:28] (WidespreadPuppetFailure) resolved: (2) Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983233 [00:38:37] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983233 (owner: 10TrainBranchBot) [00:44:16] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@63804c4]: (no justification provided) [00:44:42] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@63804c4]: (no justification provided) (duration: 00m 25s) [00:59:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/983233 (owner: 10TrainBranchBot) [01:07:34] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10ANakanishi_WMF) @jhathaway I approve, thanks! [01:08:26] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:21:15] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@fb10de1]: (no justification provided) [01:21:26] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@fb10de1]: (no justification provided) (duration: 00m 10s) [01:30:11] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:49] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10matmarex) 05Open→03Resolved Thank you, it works again! My local copy was 22130 commits behind :) [02:36:52] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:26] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:16:22] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10jhathaway) Glad to hear @matmarex and apologies for the long wait, I will try to get the CI check added as well to avoid any regressions, l... [04:18:27] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [05:08:26] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:32:45] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [05:52:45] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:59:38] 10SRE, 10SRE-Access-Requests: Replace Kbrown's old ssh public key with a new one - https://phabricator.wikimedia.org/T353467 (10Nahid) Thank you. [07:28:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:33:29] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:39:19] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:59:15] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:00:27] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:02:05] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:02:09] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:54] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:05] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:56] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:21] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:18:27] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [08:21:39] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:24:09] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:27] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:53] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:27] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:11] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:50:59] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:27] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:53] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:31] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:43] PROBLEM - cassandra-a service on restbase2028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:08:43] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:08:43] PROBLEM - Check systemd state on restbase2028 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:57] PROBLEM - cassandra-a CQL 10.192.16.237:9042 on restbase2028 is CRITICAL: connect to address 10.192.16.237 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:09:05] PROBLEM - cassandra-a SSL 10.192.16.237:7000 on restbase2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:15:55] RECOVERY - cassandra-a service on restbase2028 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:16:03] RECOVERY - Check systemd state on restbase2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:10] (03PS1) 10FNegri: [toolsdb] Use jemalloc to prevent memory issues [puppet] - 10https://gerrit.wikimedia.org/r/983513 (https://phabricator.wikimedia.org/T353093) [09:16:27] RECOVERY - cassandra-a CQL 10.192.16.237:9042 on restbase2028 is OK: TCP OK - 0.075 second response time on 10.192.16.237 port 9042 https://phabricator.wikimedia.org/T93886 [09:16:35] RECOVERY - cassandra-a SSL 10.192.16.237:7000 on restbase2028 is OK: SSL OK - Certificate restbase2028-a valid until 2025-12-03 21:32:59 +0000 (expires in 718 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [09:16:41] (03CR) 10CI reject: [V: 04-1] [toolsdb] Use jemalloc to prevent memory issues [puppet] - 10https://gerrit.wikimedia.org/r/983513 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [09:20:00] (03PS2) 10FNegri: [toolsdb] Use jemalloc to prevent memory issues [puppet] - 10https://gerrit.wikimedia.org/r/983513 (https://phabricator.wikimedia.org/T353093) [09:21:29] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:24:55] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/939/con" [puppet] - 10https://gerrit.wikimedia.org/r/983513 (https://phabricator.wikimedia.org/T353093) (owner: 10FNegri) [09:26:09] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:27:29] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:50:11] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:39] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:17] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:49] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:24:21] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:28:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:18:27] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [12:45:39] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - No response from remote host 185.15.58.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:06:48] (03CR) 10Elukey: recommendation-api: update statsd configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [13:08:27] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:29:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:39:23] (03PS3) 10Elukey: recommendation-api: update monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) [13:39:25] (03PS3) 10Elukey: services: deploy the new rec-api-ng Docker image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983404 (https://phabricator.wikimedia.org/T349118) [13:40:04] (03CR) 10CI reject: [V: 04-1] recommendation-api: update monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [13:40:06] (03CR) 10CI reject: [V: 04-1] services: deploy the new rec-api-ng Docker image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983404 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [13:44:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:53:19] (03PS4) 10Elukey: recommendation-api: update monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) [13:53:21] (03PS4) 10Elukey: services: deploy the new rec-api-ng Docker image in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/983404 (https://phabricator.wikimedia.org/T349118) [13:56:34] (03CR) 10Elukey: recommendation-api: update monitoring config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/983403 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [14:36:53] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:53] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:28:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:16:56] (03PS1) 10Majavah: shared: lighttpd: fix override file path [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/983520 [16:18:01] (03PS2) 10Majavah: shared: lighttpd: fix override file path [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/983520 (https://phabricator.wikimedia.org/T293552) [16:18:27] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [16:25:21] (03PS1) 10Chlod Alejandro: Revert "util.main: Don't use mw.Map(), use a native Map() instead" [extensions/PageTriage] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/983529 (https://phabricator.wikimedia.org/T353571) [17:08:27] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:51:56] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:53] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:53] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:18:27] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: (4) The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [20:34:13] (03CR) 10BryanDavis: [C: 03+2] "Nice find." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/983520 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah) [20:34:50] (03Merged) 10jenkins-bot: shared: lighttpd: fix override file path [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/983520 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah) [20:57:21] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:45] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:53] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:15] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:27] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:28:06] 10SRE-swift-storage, 10Commons: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons - https://phabricator.wikimedia.org/T353498 (10Aklapper) If this also happens with chunked uploader it is not an UploadWizard issue [21:53:15] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:17] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:53] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:17] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:31] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:00:25] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:09] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:27] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:33:27] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure