[00:02:17] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:06:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [00:07:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1185294 [00:07:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1185294 (owner: 10TrainBranchBot) [00:14:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.680 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:28:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1185294 (owner: 10TrainBranchBot) [00:31:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:36:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 2.886 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:00:39] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:09:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:12:24] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 44s) [01:16:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:21:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.464 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [01:21:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:24:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 1.379 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:26:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [01:26:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 3.788 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:32:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:55] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.812 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [01:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:34:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [01:39:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.508 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [01:44:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [01:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:49:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:52:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.364 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [01:54:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:08:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:19:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.436 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:20:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:29:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.752 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:31:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:42:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.472 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:43:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:46:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.233 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:49:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:51:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:52:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.262 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:54:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:54:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:57:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.228 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [02:59:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.389 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:03:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [03:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:09:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.472 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [03:11:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [03:12:27] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [03:14:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:16:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:18:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.447 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [03:19:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [03:24:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.314 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:28:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.468 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [03:29:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [03:36:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.248 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [03:39:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [03:39:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:49:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:51:30] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.144 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:02:32] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:07:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.254 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:07:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:12:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:14:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:18:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.286 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:19:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:19:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:24:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 2.430 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:26:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.233 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:29:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:33:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.287 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:40:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:43:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.278 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:48:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:49:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:51:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:54:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.296 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:55:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [04:56:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:58:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.300 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:00:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:04:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.431 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:08:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:22:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:23:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.334 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:25:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:28:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.414 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:30:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:33:57] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.366 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:44:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [05:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:09:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.450 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:10:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:23:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.582 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:24:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:27:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.237 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:33:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:38:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.246 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:46:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:50:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.319 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:54:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [06:58:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.362 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:00:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250907T0700) [07:08:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.269 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:12:27] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [07:15:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:28:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.426 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:29:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:32:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.351 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:33:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:38:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.327 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:40:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [08:01:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.346 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [08:02:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [08:02:32] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:55:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.295 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [08:57:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [09:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:06:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:08:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.655 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [09:11:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [09:11:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:22:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:29:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.329 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [09:32:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [09:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:51:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.329 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [09:52:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [10:04:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:07:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.461 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [10:08:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [10:09:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 1.496 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:14:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.365 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [10:16:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [10:45:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.566 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [10:46:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [11:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:12:27] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [11:16:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:21:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.147 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:17] RESOLVED: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:29:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:34:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.285 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [11:34:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 2.348 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:37:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [11:42:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.652 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [11:43:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [11:56:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.201 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [11:57:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [12:16:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:31:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54828 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:39:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.317 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [12:39:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:40:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [12:44:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.512 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:46:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 8.844 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:54:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.477 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [12:56:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [12:59:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:04:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:14:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.104 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [13:15:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [13:22:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:29:08] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.325 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [13:30:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [13:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:08:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:27] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [15:24:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:28:57] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:33:57] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:48:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.280 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [15:50:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [17:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:22:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:29:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:49:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.308 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [18:10:49] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:11:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [18:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:06:36] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 9.761 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:12:27] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [19:21:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:31:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 9.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:01:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:16:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:04:05] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:04:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:19:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.985 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:22:06] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:33:57] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:48:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:41:13] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [22:45:18] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1051 - vriley@cumin1003" [22:45:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt es1051 - vriley@cumin1003" [22:45:50] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:46:18] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es1051 [22:47:32] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1051 [22:48:06] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host es1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:04:06] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.340 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [23:05:04] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [23:08:57] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:10:26] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:12:28] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [23:33:07] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host es1051.eqiad.wmnet with OS bookworm [23:33:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11155905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host es1051.eqiad.wmnet with OS bookworm [23:38:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1185433 [23:38:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1185433 (owner: 10TrainBranchBot) [23:51:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1185433 (owner: 10TrainBranchBot) [23:59:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring