[00:50:46] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 1044 MB (5% inode=91%); [02:20:49] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 1008 MB (5% inode=91%); [03:50:53] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 972 MB (5% inode=91%); [05:20:57] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 935 MB (5% inode=91%); [06:26:44] RECOVERY - check disk on ORES-web02.Experimental is OK: DISK OK [09:39:52] 10ORES, 10Scoring-platform-team (Current), 10Wikimedia-Incident: ORES incident 20190320 documentation - https://phabricator.wikimedia.org/T218791 (10akosiaris) 05Open→03Resolved Incident report updated. The only actionable followup task is T122676 [09:48:44] Amir1: So the celery task queue worked [09:48:50] look at https://grafana.wikimedia.org/d/000000255/ores?refresh=1m&panelId=61&fullscreen&orgId=1&from=now-24h&to=now-1m [09:49:18] it's the result of the outage yesterday but you get the idea [09:49:31] I 'll clean it up, it's useless right now [09:54:17] !log del celery on oresrdb2002, leftovers from yesterday's incident [09:54:18] akosiaris: Not expecting to hear !log here [09:54:18] Sorry, you are not authorized to perform this [09:54:18] No hay log abierto en #wikimedia-ai - log on para abrirlo, log list para listar los logs disponibles. [09:54:24] ahahahahaha [09:54:44] talking about AI... [12:16:33] o/ [12:16:47] akosiaris: thank you so much [12:38:56] akosiaris: for when you have time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/497316 [13:04:02] heads up, 2nd try for https://gerrit.wikimedia.org/r/#/c/operations/dns/+/498067/ [13:24:00] we got some very minor elevated errors in codfw (2.7/min) but currently under control [13:24:47] I am not touching anything btw on purpose, I am leaving the infrastructure drain from the primary to the secondary redis database on its own [13:30:38] and it's becoming clear that while fine, it's going to be a long tail distribution. As the last uwsgi/celery workers will have a very small chance to receive traffic and thus reach their restart threshold. [13:30:57] ok, moving with a rolling restart then [13:34:30] !log restart all uwsgi and celery on ores2* [13:34:30] akosiaris: Not expecting to hear !log here [13:34:30] No hay log abierto en #wikimedia-ai - log on para abrirlo, log list para listar los logs disponibles. [13:34:43] what? no wm-bot4 this time around? [13:42:16] akosiaris: hmm [13:42:50] akosiaris: wm-bot4 is here [13:43:10] akosiaris: also !log in -cloud [13:43:12] :) [13:45:48] ok, 1 brief overload incident and we are ok [13:45:53] I 'll reboot oresrdb2001 now [13:55:53] PROBLEM - ores grafana alert on icinga1001 is CRITICAL: CRITICAL: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is alerting: ORES CPU usage alert codfw. https://wikitech.wikimedia.org/wiki/ORES [13:56:57] RECOVERY - ores grafana alert on icinga1001 is OK: OK: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is not alerting. https://wikitech.wikimedia.org/wiki/ORES [13:57:04] * halfak looks on [13:57:20] CPU ? [13:57:23] which one? [13:57:47] oresrdb2001 [13:58:31] ah that graph is counting wrong [13:58:32] 10Scoring-platform-team, 10CommRel-Specialists-Support, 10User-Johan: Community Relations Specialist support for Scoring Platform - https://phabricator.wikimedia.org/T217232 (10Johan) Met with @Harej. We talked about what we could do generally and he should come back with a more specific request. [13:58:37] 100 - (avg by (instance) (irate(node_cpu{mode="idle", instance=~"ores.+"}[5m])) * 100) [13:58:44] so what happens when the host is being rebooted? [13:58:58] and does not send metrics? [13:59:00] 100% :P [13:59:15] Oh god. Why is that equation like that. [13:59:50] it's trying to discount idle cpu time [14:00:23] That's an odd strategy [14:00:28] * halfak hops into a meeting. [14:00:32] yeah it looks weird to me too [14:00:36] having a closer look now [14:23:53] PROBLEM - ores grafana alert on icinga1001 is CRITICAL: CRITICAL: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is alerting: Overload errors alert. https://wikitech.wikimedia.org/wiki/ORES [14:24:26] that seems to be a valid one.... the switchback is not going as great [14:24:29] looking [14:25:45] we are not erroring out btw, just backlogged [14:26:23] !log restart ores-uwsgi and ores-celery-worker on ores2* [14:26:23] akosiaris: Not expecting to hear !log here [14:26:23] No hay log abierto en #wikimedia-ai - log on para abrirlo, log list para listar los logs disponibles. [14:28:58] * hare throws peanuts at the bots for his amusement [14:29:14] lol [14:30:02] out of the woods per grafana now [14:33:42] akosiaris: None of your logs are going through since log is not enabled here you have to log in -cloud [14:34:01] RECOVERY - ores grafana alert on icinga1001 is OK: OK: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is not alerting. https://wikitech.wikimedia.org/wiki/ORES [14:34:28] akosiaris, what are you working on anyway? [14:36:22] Zppix: I know. I am not trying to log into the bots actually. Just leaving them in the backlog so people know [14:36:29] halfak: kernel upgrades for oresrdb hosts [14:36:31] Ah [14:36:32] I am done [14:36:39] Aha. Gotcha. [14:36:58] I also gained some insight into how the system responds to redis database failovers [14:37:07] * halfak looks for better ways to get CPU% [14:37:13] that should come in handy for https://phabricator.wikimedia.org/T122676 [14:37:26] halfak: ok I am fixing that already, no worries [14:37:41] it's actually easy [14:37:47] Gotcha. Cool thank you. [14:41:23] I 'll probably do one more round on Tuesday. I want to collect some more data. It should help answer some questions for automatic failover [16:00:02] 10Scoring-platform-team, 10revscoring, 10artificial-intelligence: Develop manual testing strategy for bias detection - https://phabricator.wikimedia.org/T117425 (10awight) [16:01:45] Amir1, hare: running a coiuple minutes late sorry! [16:20:17] afk for a bit [16:39:06] back now [17:24:06] wikimedia/editquality#528 (modular_template - 85a418a : Aaron Halfaker): The build was fixed. https://travis-ci.org/wikimedia/editquality/builds/509544427 [17:33:41] 10Scoring-platform-team, 10Data-Services, 10Wikilabels, 10cloud-services-team (Kanban): postgresql on clouddb1002 needs some kind of puppet management of pg_hba.conf - https://phabricator.wikimedia.org/T209396 (10Bstorm) [17:33:53] 10Scoring-platform-team, 10Data-Services, 10Wikilabels, 10cloud-services-team (Kanban): postgresql on clouddb1002 needs some kind of puppet management of pg_hba.conf - https://phabricator.wikimedia.org/T209396 (10Bstorm) p:05Normal→03Low [17:37:56] I'm done for the day [17:38:02] o/ [21:16:04] 10Scoring-platform-team, 10artificial-intelligence: Implement NSFW image classifier using Open NSFW - https://phabricator.wikimedia.org/T214201 (10MusikAnimal) The saga continues. @Halfak and other kind, smart people... I'm pleading that we get this show on the road. I have set up the open_nsfw-- app on VPS a... [22:08:08] woops. Forgot to rename myself when I went back to meetings. [23:08:12] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 1081 MB (5% inode=91%);