[01:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:32:50] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:30] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:32] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:46:28] PROBLEM - Host an-worker1132 is DOWN: PING CRITICAL - Packet loss = 100% [01:47:04] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:40:24] (03CR) 10Stang: [C: 04-1] "Please run https://github.com/wikimedia/operations-mediawiki-config/blob/master/logos/manage.py to download images" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875914 (https://phabricator.wikimedia.org/T326067) (owner: 10Ghybu) [05:08:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:21:08] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230107T0800) [08:22:52] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 117 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [09:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:28:17] 10SRE, 10Data-Engineering: an-worker1132 down - https://phabricator.wikimedia.org/T326459 (10RhinosF1) [12:50:38] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 104 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:52:12] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [13:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:50:18] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:36] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:45:40] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:28] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:48:59] (03PS1) 10Stang: arwiki: Create extendedmover group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876310 (https://phabricator.wikimedia.org/T326434) [14:55:26] (03PS1) 10Stang: extwiki: Install SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876311 (https://phabricator.wikimedia.org/T326450) [16:31:50] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:32:02] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [17:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:28:56] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:56] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:46:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:51:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:28] 10SRE, 10DBA, 10MediaWiki-libs-Rdbms, 10MW-1.40-notes (1.40.0-wmf.18; 2023-01-09), 10Performance-Team (Radar): Check if setBigSelects() is still needed - https://phabricator.wikimedia.org/T325610 (10Ladsgroup) 05Open→03Resolved a:03Krinkle [19:27:37] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10phaultfinder) [20:44:10] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:45:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:48:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:52:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:53:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:07:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.251 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:07:48] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:08:20] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:25:28] RECOVERY - Query Service HTTP Port on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 449 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:32:08] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.210 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:42:01] (03PS1) 10QChris: Add .gitreview [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/876324 [22:42:04] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/876324 (owner: 10QChris) [23:10:47] (03PS6) 10Deni: Allow administrators to revoke autopatroller rights on sh.WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/871272 (https://phabricator.wikimedia.org/T325938) (owner: 10Acamicamacaraca)