[00:01:41] PROBLEM - AQS root url on aqs1005 is CRITICAL: connect to address 10.64.32.138 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [00:04:24] PROBLEM - AQS root url on aqs1009 is CRITICAL: connect to address 10.64.48.119 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [00:04:49] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [00:04:53] PROBLEM - AQS root url on aqs1006 is CRITICAL: connect to address 10.64.48.146 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [00:06:06] nuria: are you online? [00:06:19] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [00:46:29] RECOVERY - AQS root url on aqs1005 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [00:53:15] RECOVERY - AQS root url on aqs1006 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [00:54:51] RECOVERY - AQS root url on aqs1009 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [08:53:09] I sent an email about the AQS failures [08:53:59] it was due to a networking problem, but it seems that the service should have recovered sooner (the outage ended but SRE had to manually restart some aqs daemons even after to force the recovery) [08:54:11] seems related to application caching of DNS resolution failures [08:54:17] anyway, all good now [08:55:18] I guess that all those cassandra-daily-coord-local_group_default_T_mediarequest_per_file errors are false positives right? [08:55:48] fdans: o/ [08:55:51] are you on line? [08:55:57] just seen your email [08:56:05] elukey: hellooo [08:56:45] hello :) [08:56:53] elukey: yeah, I think maybe we don't need SLA alarms when we're backfilling? [08:57:37] there should be a way to easily turn off/on those alarms, it is really a pain with oozie currently [08:57:56] anyway, I think that Nuria suspended the job to avoid all that spam to alerts [08:58:11] (there has been an outage for AQS in the middle of that burst) [08:58:31] yeah I was looking at it when it happened [08:58:43] but the last SLA alert arrived at around 2 AM my time [08:58:54] and the coord was suspended at ~6 something [08:58:57] is it ok now? [08:59:05] if so we can un-suspend probably [08:59:15] elukey: yeah I'm going to unsuspend [08:59:32] please !log in here too [08:59:57] do you need anything? Sounds like everything is under control [09:00:05] !log resumed per file mediarequests backfiling coordinator [09:00:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:00:13] maybe let's check in a couple of hours if the alerts are piling up or not [09:00:27] elukey: no all good, I panicked a little when it happened because I thought my job was causing the aqs outage [09:00:48] but then I saw the activity in the ops channel [09:00:58] thank you elukey , have a nice saturday :) [09:01:24] super, you too! [09:01:50] it was really unfortunate that an independent AQS outage happened, I would have been worried too :) [23:06:37] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Jul-Sep 2019): add whether an edit happened on cloud VPS to geoeditors-daily dataset - https://phabricator.wikimedia.org/T233504 (10Nuria) [23:06:49] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Jul-Sep 2019): add whether an edit happened on cloud VPS to geoeditors-daily dataset - https://phabricator.wikimedia.org/T233504 (10Nuria) We have a UDF that @bd808 work on a while back that can classify iPS as coming from cloud: http... [23:08:21] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Jul-Sep 2019): add whether an edit happened on cloud VPS to geoeditors-daily dataset - https://phabricator.wikimedia.org/T233504 (10Nuria) Can someone from #cloud-services-team confirm we wnat the dashboard with this data to be public... [23:38:22] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Jul-Sep 2019): add whether an edit happened on cloud VPS to geoeditors-daily dataset - https://phabricator.wikimedia.org/T233504 (10Nuria) a:03JAllemandou