[00:42:44] <icinga2-wm>	 PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 1044 MB (5% inode=91%);
[02:12:49] <icinga2-wm>	 PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 1007 MB (5% inode=91%);
[03:42:52] <icinga2-wm>	 PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 971 MB (5% inode=91%);
[05:12:56] <icinga2-wm>	 PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 934 MB (5% inode=91%);
[06:26:44] <icinga2-wm>	 RECOVERY - check disk on ORES-web02.Experimental is OK: DISK OK
[06:28:45] <icinga-wm>	 PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/ORES
[06:54:11] <icinga-wm>	 RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/ORES
[11:53:50] <Amir1>	 afk for a bit
[13:59:26] <Amir1>	 back
[14:01:30] <halfak>	 o/ Amir1 
[14:01:32] <halfak>	 I'm in the call. 
[14:02:01] <Amir1>	 on my way
[14:09:15] <icinga-wm>	 PROBLEM - ores grafana alert on icinga1001 is CRITICAL: CRITICAL: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is alerting: ORES CPU usage alert codfw. https://wikitech.wikimedia.org/wiki/ORES
[14:16:55] <icinga-wm>	 RECOVERY - ores grafana alert on icinga1001 is OK: OK: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is not alerting. https://wikitech.wikimedia.org/wiki/ORES
[14:25:51] <icinga-wm>	 PROBLEM - ores grafana alert on icinga1001 is CRITICAL: CRITICAL: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is alerting: Overload errors alert. https://wikitech.wikimedia.org/wiki/ORES
[14:26:21] <Amir1>	 akosiaris: It seems ores is overloaded in codfw. https://grafana.wikimedia.org/d/000000255/ores?refresh=1m&orgId=1
[14:36:50] <akosiaris>	 Amir1: it's probably https://gerrit.wikimedia.org/r/#/c/operations/dns/+/497772/
[14:37:06] <akosiaris>	 got any logs. I did not expect it to cause an issue
[14:37:08] <akosiaris>	 ?
[14:37:31] <Amir1>	 akosiaris: no graphs are going crazy
[14:38:07] <halfak>	 akosiaris, looks like a spike in external requests. 
[14:38:16] <akosiaris>	 doubtful
[14:38:23] <akosiaris>	 this time too way to my change
[14:38:29] <akosiaris>	 way too well*
[14:38:37] <halfak>	 I agree, but we do see a spike in external requests by all measures. 
[14:39:04] <halfak>	 https://grafana.wikimedia.org/d/000000255/ores?refresh=1m&panelId=1&fullscreen&orgId=1&from=now-2d&to=now-1m
[14:39:06] <halfak>	 akosiaris, ^ 
[14:39:28] <halfak>	 Maybe we were handling this spike in external requests *until* the redis change was deployed. 
[14:40:00] <akosiaris>	 looking at the request rate in ores2001, it doesn't seem elevated at all
[14:40:11] <akosiaris>	 and it's almost entirely changeprop
[14:40:21] <akosiaris>	 lemme reverst the switchover of the redis just in case
[14:43:21] <akosiaris>	 I 'll force a worker restart just to make sure it was that
[14:52:38] <halfak>	 ok sounds good akosiaris 
[14:56:55] <akosiaris>	 seems like overloads are dropping
[14:57:14] <akosiaris>	 but why did this cause such an issue?
[14:57:20] <akosiaris>	 it shouldn't have
[14:57:47] <halfak>	 It looks like it didn't until the redis switch happened. 
[14:58:39] <akosiaris>	 hmm I think I now have an idea why
[15:00:08] <akosiaris>	 btw, no way ores2001 is receiving the 68 external req/s https://grafana.wikimedia.org/d/000000255/ores?refresh=1m&panelId=1&fullscreen&orgId=1&from=1553090968430&to=1553093713632 says it does
[15:00:33] <akosiaris>	 logs don't support that
[15:00:46] <halfak>	 I'm not seeing that akosiaris 
[15:00:53] <halfak>	 That
[15:00:56] <akosiaris>	 ores.ores1*.scores_request.count ?
[15:00:58] <akosiaris>	 .count? 
[15:01:00] <akosiaris>	 that's wrong
[15:01:12] <akosiaris>	 .count is the non-normalized version
[15:01:18] <akosiaris>	 the normalized one is .rate
[15:01:20] <halfak>	 I'm seeing ~250 reqs per minute
[15:01:33] <icinga-wm>	 RECOVERY - ores grafana alert on icinga1001 is OK: OK: ORES ( https://grafana.wikimedia.org/d/000000255/ores ) is not alerting. https://wikitech.wikimedia.org/wiki/ORES
[15:01:42] <akosiaris>	 ah there is though the scaleToSeconds
[15:01:46] <akosiaris>	 damn I hate graphite
[15:01:52] <halfak>	 lol
[15:02:29] <halfak>	 I see that we are returning precaching now. 
[15:02:32] <halfak>	 From codfw.  
[15:02:36] <halfak>	 It looks like we are recovering. 
[15:03:24] <wm-bot4>	 Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @halfak & @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting
[15:03:41] <akosiaris>	 we 'll have to repeat this. We haven't finished the maintenance
[15:03:55] <akosiaris>	 but I 'll leave this for tomorrow
[15:04:13] <akosiaris>	 all this vandalism fighting all these days has left me with minimal cognitive skills it seems
[15:05:09] <halfak>	 akosiaris, do you want to start an incident report or should I?
[15:05:24] <akosiaris>	 halfak: please do
[15:05:33] <akosiaris>	 thanks
[15:05:34] <halfak>	 OK.  Will do. 
[15:06:00] <halfak>	 No problem.  Thanks for your help getting this squared. 
[15:06:09] <akosiaris>	 heh, seems like I caused it
[15:06:16] <akosiaris>	 at least I justified my t-shirt
[15:17:00] <halfak>	 akosiaris, it looks like our queue length in codfw is not recovering. 
[15:17:22] <halfak>	 We're still returning some overload responses too. 
[15:18:14] <halfak>	 It's interesting that the queue length on 2002 is longer than 2001.  Shouldn't celery be being mirrored between 2002 and 2001?
[15:18:22] <halfak>	 Or do we only mirror the score cache?
[15:21:02] <akosiaris>	 the celery queue is not persistent, nor mirroed
[15:21:20] <akosiaris>	 there was a task about that
[15:21:21] <halfak>	 Gotcha.  Make sense.  Just checking. 
[15:21:42] <halfak>	 I don't think we need the queue mirrored until we have an automated switchover strategy that won't make celery crazy. 
[15:23:20] <halfak>	 https://wikitech.wikimedia.org/wiki/Incident_documentation/20190320-ORES
[15:26:43] <wikibugs>	 10ORES, 10Scoring-platform-team (Current), 10Wikimedia-Incident: ORES incident 20190320 documentation - https://phabricator.wikimedia.org/T218791 (10Halfak)
[15:28:01] <wikibugs>	 10ORES, 10Scoring-platform-team (Current), 10Wikimedia-Incident: ORES incident 20190320 documentation - https://phabricator.wikimedia.org/T218791 (10Halfak) a:03akosiaris Assigning to @akosiaris so that he can add details to the incident report.  Please feel free to unassign when you are done.    What foll...
[15:52:59] <wm-bot4>	 Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @halfak & @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting
[17:52:02] <wikibugs>	 10Scoring-platform-team (Current), 10Wikilabels, 10User-Ladsgroup: Wikilabels needs manual reboot when DB connection is broken - https://phabricator.wikimedia.org/T209604 (10Halfak) What does removing the explicit transaction or using query building have to do with this task.  I'm confused.
[19:07:26] <icinga-wm>	 PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5687 bytes in 5.039 second response time https://wikitech.wikimedia.org/wiki/ORES
[19:07:38] <icinga-wm>	 PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 6520 bytes in 5.047 second response time https://wikitech.wikimedia.org/wiki/ORES
[19:07:48] <icinga-wm>	 PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 5687 bytes in 5.051 second response time https://wikitech.wikimedia.org/wiki/ORES
[19:07:49] <halfak>	 I smell an API issue. 
[19:07:55] <halfak>	 MWAPI that is
[19:08:38] <icinga-wm>	 RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/ORES
[19:08:52] <icinga-wm>	 RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 973 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/ORES
[19:10:18] <icinga-wm>	 RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.464 second response time https://wikitech.wikimedia.org/wiki/ORES
[19:10:43] <halfak>	 Yup.  Looks like it. 
[19:11:02] <halfak>	 Whenever labs and prod go down at the same time, we know it was probably an issue with something they have in common. 
[22:12:02] <Krenair>	 Anyone recognise a "logstash_host: localhost" change to hieradata/role/common/ores.yaml?
[22:12:10] <Krenair>	 It's sat uncommitted on deployment-puppetmaster03
[22:17:12] <Zppix>	 halfak: ^^
[22:17:50] <Krenair>	 nvm, found it
[22:21:00] <Zppix>	 Krenair: sweet
[22:23:43] <Krenair>	 (see -releng)
[22:40:25] <Zppix>	 Krenair: i saw that now its hard to keep track of multi channel converstations
[23:20:44] <icinga2-wm>	 PROBLEM - check disk on ORES-web02.Experimental is WARNING: DISK WARNING - free space: / 1081 MB (5% inode=91%);