[00:09:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154354 [00:09:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154354 (owner: 10TrainBranchBot) [00:10:48] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:28:29] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154354 (owner: 10TrainBranchBot) [00:44:14] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/dc3960d7e8fb7bb119a975e698e50b4f7d44a8ea723af9ba237ba6528e3cce82/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [01:04:14] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:48:39] (03CR) 10Bartosz Dziewoński: [C:03+1] "As I understand it, we're confident that this is fine to do, but we wanted to investigate T393963 first for academic reasons. I'm not sure" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [01:51:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [01:51:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) (owner: 10Gergő Tisza) [01:51:51] (03CR) 10Bartosz Dziewoński: [C:03+1] logging: Sample some high-volume log streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) (owner: 10Gergő Tisza) [01:52:01] (03CR) 10Bartosz Dziewoński: [C:03+1] logging: Allow sampling of Logstash logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [02:06:00] PROBLEM - Disk space on an-worker1093 is CRITICAL: DISK CRITICAL - free space: / 2018 MB (3% inode=93%): /tmp 2018 MB (3% inode=93%): /var/tmp 2018 MB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [02:17:48] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:53:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [04:06:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:08:49] FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:16:30] FIRING: WebrequestSampledDown: Benthos metrics for webrequest_sampled are not reported from eqiad and codfw - https://wikitech.wikimedia.org/wiki/Benthos#Benthos_on_centrallog - https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?var-port=4151 - https://alerts.wikimedia.org/?q=alertname%3DWebrequestSampledDown [04:16:30] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [04:16:41] FIRING: SLOMetricAbsent: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:16:45] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [04:16:56] FIRING: [2x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:17:37] RESOLVED: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [04:17:37] FIRING: [9x] SLOMetricAbsent: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:18:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:18:49] RESOLVED: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:18:51] !log restarted apache on phab1004 [04:18:52] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [04:18:52] FIRING: [9x] SLOMetricAbsent: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:21:30] RESOLVED: WebrequestSampledDown: Benthos metrics for webrequest_sampled are not reported from eqiad and codfw - https://wikitech.wikimedia.org/wiki/Benthos#Benthos_on_centrallog - https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?var-port=4151 - https://alerts.wikimedia.org/?q=alertname%3DWebrequestSampledDown [04:21:30] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [04:21:42] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [04:21:45] RESOLVED: [6x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:21:49] RESOLVED: [6x] SLOMetricAbsent: etcd-latency codfw - https://slo.wikimedia.org/?search=etcd-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:22:14] PROBLEM - Dell PowerEdge RAID / Supermicro Broadcom Controller on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [04:22:37] RESOLVED: [9x] SLOMetricAbsent: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:23:12] PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2092 MB (3% inode=95%): /tmp 2092 MB (3% inode=95%): /var/tmp 2092 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [04:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:22:34] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:21:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:26:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:29:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:34:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:38:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:43:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:19:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:22:34] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 145 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:24:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:47:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:49:49] FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:50:46] I don't have my computer with me [07:50:58] checking :) [07:51:02] !incidents [07:51:03] 6316 (UNACKED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [07:51:03] 6315 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [07:51:03] 6308 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [07:51:04] 6309 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [07:51:13] !ack 6316 [07:51:14] 6316 (ACKED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [07:51:50] jelto: if you can write - anything weird/ongoing in the past days that I should be aware of? [07:53:01] okok from httpd: AH00288: scoreboard is full, not at MaxRequestWorkers [07:53:12] In #wikimedia-sre-collab Mutanten wrote about python requests user agents which were doing some search requests. [07:54:02] Maybe you can find something in superset and add it to requestctl? Phab was quite for the last week except yesterday the python-rrquests UAs [07:55:03] jelto: ah wait I may have missed something, can I use requestctl now with Phab? No more custom list of IPs etc..? [07:55:16] Unfortunately I don't have superset access on my phone [07:55:40] Yes you can use requestctl. There are already some actions or you can create a new one [07:57:06] okok [08:00:12] is there a reason why we are using mpm-worker and not event? Php issues? [08:00:38] anyway, probably for later [08:00:46] There are some spikes at 3:00 UTC and 7:00. Superset might give some insights at that time [08:00:47] is way easier in these cases to get the workers clogged like this [08:01:30] yes yes I am going to check, but from https://grafana.wikimedia.org/d/000000587/phabricator?orgId=1&from=now-24h&to=now&timezone=utc&var-node=phab1004&viewPanel=panel-23 it seems that we have ~30 rps that is not a lot [08:02:42] Yes but probably expensive search queries ? These can consume quite some resources. Not sure if we should move to -security [08:03:53] exactly some requests have high ttfb [08:03:56] lemme find them [08:12:50] !log restart apache2 / php-fpm on phab1004 [08:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [08:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [08:59:49] FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:00:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:04:49] RESOLVED: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:12:27] !incidents [09:12:28] 6316 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [09:12:28] 6315 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [09:12:28] 6308 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [09:12:28] 6309 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [09:13:58] (03PS1) 10Bunnypranav: core-Permissions:Restrict editing on cawikimedia to autoconfirmed only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154369 (https://phabricator.wikimedia.org/T396178) [10:54:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:57:49] FIRING: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:58:50] I'm still not at my computer :( [10:59:00] But there is some background in -security [11:01:27] anything I can do? [11:01:49] See -security [11:03:08] !incidents [11:03:08] 6317 (UNACKED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:03:08] 6316 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:03:08] 6315 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:03:09] 6308 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [11:03:09] 6309 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [11:03:13] !ack 6317 [11:03:13] 6317 (ACKED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:04:07] !incidents [11:04:08] 6317 (ACKED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:04:08] 6316 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:04:08] 6315 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:04:08] 6308 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [11:04:09] 6309 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [11:07:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye [11:07:49] RESOLVED: [3x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:08:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10893048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host thanos-be2006.codfw.wmnet with OS bullse... [11:08:36] !incidents [11:08:36] 6317 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:08:37] 6316 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:08:37] 6315 (RESOLVED) [3x] ProbeDown sre (phab1004:443 probes/custom eqiad) [11:08:37] 6308 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [11:08:37] 6309 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [11:09:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye [11:43:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10893096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host thanos-be2007.codfw.wmnet with OS bullse... [12:22:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [12:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [13:11:34] PROBLEM - Hadoop NodeManager on an-worker1205 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:19:42] PROBLEM - Hadoop NodeManager on an-worker1203 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:32:42] RECOVERY - Hadoop NodeManager on an-worker1203 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:36:34] RECOVERY - Hadoop NodeManager on an-worker1205 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:49:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:54:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:46:42] (03PS1) 10Andrew Bogott: Put cloudcontrol2010-dev into service [puppet] - 10https://gerrit.wikimedia.org/r/1154373 (https://phabricator.wikimedia.org/T396064) [14:47:28] (03CR) 10Andrew Bogott: [C:03+2] Put cloudcontrol2010-dev into service [puppet] - 10https://gerrit.wikimedia.org/r/1154373 (https://phabricator.wikimedia.org/T396064) (owner: 10Andrew Bogott) [14:50:48] (03PS1) 10Andrew Bogott: Put cloudcontrol2010-dev into service, followup [puppet] - 10https://gerrit.wikimedia.org/r/1154374 (https://phabricator.wikimedia.org/T396064) [14:55:14] (03CR) 10Andrew Bogott: [C:03+2] Put cloudcontrol2010-dev into service, followup [puppet] - 10https://gerrit.wikimedia.org/r/1154374 (https://phabricator.wikimedia.org/T396064) (owner: 10Andrew Bogott) [14:55:16] PROBLEM - Disk space on an-worker1105 is CRITICAL: DISK CRITICAL - free space: / 2060 MB (3% inode=95%): /tmp 2060 MB (3% inode=95%): /var/tmp 2060 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1105&var-datasource=eqiad+prometheus/ops [15:02:50] (03PS1) 10Andrew Bogott: Add cloudcontrol2010-dev to cloudcontrol list in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1154375 (https://phabricator.wikimedia.org/T396064) [15:03:37] (03CR) 10Andrew Bogott: [C:03+2] Add cloudcontrol2010-dev to cloudcontrol list in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1154375 (https://phabricator.wikimedia.org/T396064) (owner: 10Andrew Bogott) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:16] (03PS1) 10Andrew Bogott: Revert "Put cloudcontrol2010-dev into service" [puppet] - 10https://gerrit.wikimedia.org/r/1154376 (https://phabricator.wikimedia.org/T396064) [15:14:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:23:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:25:52] (03PS2) 10Andrew Bogott: Revert "Put cloudcontrol2010-dev into service" [puppet] - 10https://gerrit.wikimedia.org/r/1154376 (https://phabricator.wikimedia.org/T396064) [15:27:30] (03CR) 10Andrew Bogott: [C:03+2] Revert "Put cloudcontrol2010-dev into service" [puppet] - 10https://gerrit.wikimedia.org/r/1154376 (https://phabricator.wikimedia.org/T396064) (owner: 10Andrew Bogott) [15:28:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:55:16] PROBLEM - Disk space on an-worker1105 is CRITICAL: DISK CRITICAL - free space: / 2063 MB (3% inode=95%): /tmp 2063 MB (3% inode=95%): /var/tmp 2063 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1105&var-datasource=eqiad+prometheus/ops [16:06:00] PROBLEM - Disk space on an-worker1093 is CRITICAL: DISK CRITICAL - free space: / 1925 MB (3% inode=93%): /tmp 1925 MB (3% inode=93%): /var/tmp 1925 MB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [16:22:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [16:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [17:29:19] (03Abandoned) 10Lucas Werkmeister: beta cluster: Disable $wgOATHRequiredForGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153674 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [17:49:16] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:49:26] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:51:24] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:20] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54084 bytes in 9.603 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:52:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [17:52:51] via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [17:53:06] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:57:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:02:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [18:02:51] via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [18:03:13] anybody handling this? [18:06:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [18:06:51] via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [18:08:40] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:09:16] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:09:44] federico3: not that I'm aware of. The high level steps are at https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_80% [18:12:06] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:14:27] sobanski: I haven't been in any oncall shift yet but there's anything I can do to help? Want me to do follow the runbook? [18:16:37] Most certainly :) [18:17:19] I can see link saturation on cr2-codfw on grafana [18:17:49] xe-0/1/1/0 peering with DE-CIX [18:22:23] sobanski: it looks like a 10Gbps peering with CIX in germany, so perhaps it's traffic in outbound (not cross-DC trasfer), how can I check with webrequest logs? [18:22:58] federico3: see -security [18:31:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [18:31:51] via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [18:35:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [18:35:51] via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [18:45:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, ... [18:45:51] via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [18:49:44] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [19:04:44] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [19:08:40] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:09:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:22:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [20:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [21:35:16] PROBLEM - Disk space on an-worker1105 is CRITICAL: DISK CRITICAL - free space: / 2096 MB (3% inode=95%): /tmp 2096 MB (3% inode=95%): /var/tmp 2096 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1105&var-datasource=eqiad+prometheus/ops [22:10:32] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Mon 23 Jun 2025 10:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [23:38:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154386 [23:38:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154386 (owner: 10TrainBranchBot) [23:49:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154386 (owner: 10TrainBranchBot)