[00:42:44] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 1044 MB (5% inode=91%); [02:12:49] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 1007 MB (5% inode=91%); [03:42:52] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 971 MB (5% inode=91%); [05:12:56] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 934 MB (5% inode=91%); [06:26:44] RECOVERY - check disk on ORES-web02.Experimental is OK: DISK OK [06:28:45] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/ORES [06:54:11] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/ORES [11:53:50] afk for a bit [13:59:26] back [14:01:30] o/ Amir1 [14:01:32] I'm in the call. [14:02:01] on my way [14:09:15] PROBLEM - ores grafana alert on icinga1001 is CRITICAL: CRITICAL: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is alerting: ORES CPU usage alert codfw. https://wikitech.wikimedia.org/wiki/ORES [14:16:55] RECOVERY - ores grafana alert on icinga1001 is OK: OK: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is not alerting. https://wikitech.wikimedia.org/wiki/ORES [14:25:51] PROBLEM - ores grafana alert on icinga1001 is CRITICAL: CRITICAL: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is alerting: Overload errors alert. https://wikitech.wikimedia.org/wiki/ORES [14:26:21] akosiaris: It seems ores is overloaded in codfw. https://grafana.wikimedia.org/d/000000255/ores?refresh=1m&orgId=1 [14:36:50] Amir1: it's probably https://gerrit.wikimedia.org/r/#/c/operations/dns/+/497772/ [14:37:06] got any logs. I did not expect it to cause an issue [14:37:08] ? [14:37:31] akosiaris: no graphs are going crazy [14:38:07] akosiaris, looks like a spike in external requests. [14:38:16] doubtful [14:38:23] this time too way to my change [14:38:29] way too well* [14:38:37] I agree, but we do see a spike in external requests by all measures. [14:39:04] https://grafana.wikimedia.org/d/000000255/ores?refresh=1m&panelId=1&fullscreen&orgId=1&from=now-2d&to=now-1m [14:39:06] akosiaris, ^ [14:39:28] Maybe we were handling this spike in external requests *until* the redis change was deployed. [14:40:00] looking at the request rate in ores2001, it doesn't seem elevated at all [14:40:11] and it's almost entirely changeprop [14:40:21] lemme reverst the switchover of the redis just in case [14:43:21] I 'll force a worker restart just to make sure it was that [14:52:38] ok sounds good akosiaris [14:56:55] seems like overloads are dropping [14:57:14] but why did this cause such an issue? [14:57:20] it shouldn't have [14:57:47] It looks like it didn't until the redis switch happened. [14:58:39] hmm I think I now have an idea why [15:00:08] btw, no way ores2001 is receiving the 68 external req/s https://grafana.wikimedia.org/d/000000255/ores?refresh=1m&panelId=1&fullscreen&orgId=1&from=1553090968430&to=1553093713632 says it does [15:00:33] logs don't support that [15:00:46] I'm not seeing that akosiaris [15:00:53] That [15:00:56] ores.ores1*.scores_request.count ? [15:00:58] .count? [15:01:00] that's wrong [15:01:12] .count is the non-normalized version [15:01:18] the normalized one is .rate [15:01:20] I'm seeing ~250 reqs per minute [15:01:33] RECOVERY - ores grafana alert on icinga1001 is OK: OK: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is not alerting. https://wikitech.wikimedia.org/wiki/ORES [15:01:42] ah there is though the scaleToSeconds [15:01:46] damn I hate graphite [15:01:52] lol [15:02:29] I see that we are returning precaching now. [15:02:32] From codfw. [15:02:36] It looks like we are recovering. [15:03:24] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @halfak & @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [15:03:41] we 'll have to repeat this. We haven't finished the maintenance [15:03:55] but I 'll leave this for tomorrow [15:04:13] all this vandalism fighting all these days has left me with minimal cognitive skills it seems [15:05:09] akosiaris, do you want to start an incident report or should I? [15:05:24] halfak: please do [15:05:33] thanks [15:05:34] OK. Will do. [15:06:00] No problem. Thanks for your help getting this squared. [15:06:09] heh, seems like I caused it [15:06:16] at least I justified my t-shirt [15:17:00] akosiaris, it looks like our queue length in codfw is not recovering. [15:17:22] We're still returning some overload responses too. [15:18:14] It's interesting that the queue length on 2002 is longer than 2001. Shouldn't celery be being mirrored between 2002 and 2001? [15:18:22] Or do we only mirror the score cache? [15:21:02] the celery queue is not persistent, nor mirroed [15:21:20] there was a task about that [15:21:21] Gotcha. Make sense. Just checking. [15:21:42] I don't think we need the queue mirrored until we have an automated switchover strategy that won't make celery crazy. [15:23:20] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190320-ORES [15:26:43] 10ORES, 10Scoring-platform-team (Current), 10Wikimedia-Incident: ORES incident 20190320 documentation - https://phabricator.wikimedia.org/T218791 (10Halfak) [15:28:01] 10ORES, 10Scoring-platform-team (Current), 10Wikimedia-Incident: ORES incident 20190320 documentation - https://phabricator.wikimedia.org/T218791 (10Halfak) a:03akosiaris Assigning to @akosiaris so that he can add details to the incident report. Please feel free to unassign when you are done. What foll... [15:52:59] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @halfak & @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [17:52:02] 10Scoring-platform-team (Current), 10Wikilabels, 10User-Ladsgroup: Wikilabels needs manual reboot when DB connection is broken - https://phabricator.wikimedia.org/T209604 (10Halfak) What does removing the explicit transaction or using query building have to do with this task. I'm confused. [19:07:26] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5687 bytes in 5.039 second response time https://wikitech.wikimedia.org/wiki/ORES [19:07:38] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 6520 bytes in 5.047 second response time https://wikitech.wikimedia.org/wiki/ORES [19:07:48] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5687 bytes in 5.051 second response time https://wikitech.wikimedia.org/wiki/ORES [19:07:49] I smell an API issue. [19:07:55] MWAPI that is [19:08:38] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/ORES [19:08:52] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/ORES [19:10:18] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.464 second response time https://wikitech.wikimedia.org/wiki/ORES [19:10:43] Yup. Looks like it. [19:11:02] Whenever labs and prod go down at the same time, we know it was probably an issue with something they have in common. [22:12:02] Anyone recognise a "logstash_host: localhost" change to hieradata/role/common/ores.yaml? [22:12:10] It's sat uncommitted on deployment-puppetmaster03 [22:17:12] halfak: ^^ [22:17:50] nvm, found it [22:21:00] Krenair: sweet [22:23:43] (see -releng) [22:40:25] Krenair: i saw that now its hard to keep track of multi channel converstations [23:20:44] PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 1081 MB (5% inode=91%);