[00:25:05] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.088 second response time [00:25:14] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.120 second response time [00:25:15] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.111 second response time [00:27:05] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 384 bytes in 1.129 second response time [00:27:15] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 419 bytes in 0.111 second response time [00:27:17] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 384 bytes in 0.580 second response time [00:39:25] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.101 second response time [00:41:25] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 384 bytes in 1.099 second response time [02:26:51] 06Revision-Scoring-As-A-Service, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES, 10ORES, 07Wikimedia-Incident: Config beta ORES extension to use the beta ORES service - https://phabricator.wikimedia.org/T141825#2513412 (10Legoktm) Well, I merged it. But I don't think we're still at the right... [02:43:05] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.119 second response time [02:43:15] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.114 second response time [02:43:15] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.089 second response time [02:45:05] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 368 bytes in 0.624 second response time [02:45:15] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 418 bytes in 0.110 second response time [02:45:15] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 383 bytes in 0.600 second response time [05:47:35] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.128 second response time [05:47:36] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.089 second response time [05:47:36] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.116 second response time [05:49:35] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 404 bytes in 0.076 second response time [05:49:35] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 404 bytes in 0.084 second response time [05:49:35] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 1.091 second response time [06:32:45] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.077 second response time [06:32:45] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.082 second response time [06:32:45] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.087 second response time [06:34:36] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 405 bytes in 0.142 second response time [06:34:36] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 1.122 second response time [06:34:45] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 0.642 second response time [08:32:45] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.074 second response time [08:32:45] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.067 second response time [08:32:46] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.083 second response time [08:34:45] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.638 second response time [08:34:45] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.622 second response time [08:34:45] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.648 second response time [09:22:16] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.104 second response time [09:22:16] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.131 second response time [09:22:16] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.123 second response time [09:24:14] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 404 bytes in 0.086 second response time [09:24:14] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 404 bytes in 0.077 second response time [09:24:15] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.656 second response time [09:39:57] *coughing* [09:40:04] ores in labs is sad [09:55:46] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.089 second response time [09:55:46] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.062 second response time [09:55:46] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.096 second response time [09:57:45] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 405 bytes in 0.144 second response time [09:57:46] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 405 bytes in 0.063 second response time [09:57:54] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 0.631 second response time [10:11:55] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.351 second response time [10:11:55] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.095 second response time [10:11:55] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.095 second response time [10:13:46] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 384 bytes in 0.598 second response time [10:13:54] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 419 bytes in 0.141 second response time [10:13:55] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 1.167 second response time [10:14:43] okay, definitely we need to do something about ores in labs [10:39:38] 06Revision-Scoring-As-A-Service, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES, 10ORES, 07Wikimedia-Incident: Config beta ORES extension to use the beta ORES service - https://phabricator.wikimedia.org/T141825#2518502 (10Ladsgroup) >>! In T141825#2517811, @Legoktm wrote: > Well, I merged it.... [11:35:55] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.060 second response time [11:35:55] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.097 second response time [11:35:56] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.089 second response time [11:37:46] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 0.591 second response time [11:37:51] 06Revision-Scoring-As-A-Service, 10ORES, 10revscoring, 07Documentation: Add MacOS instructions for installation to README - https://phabricator.wikimedia.org/T139355#2518669 (10schana) [11:37:55] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 1.083 second response time [11:37:56] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 1.080 second response time [12:16:41] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES: ORES extension jobs should fail when scoring is errored not to throw exception - https://phabricator.wikimedia.org/T141978#2518696 (10Ladsgroup) [12:45:16] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.133 second response time [12:45:16] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.140 second response time [12:45:26] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.102 second response time [12:47:15] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 405 bytes in 0.068 second response time [12:47:16] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 0.597 second response time [12:47:24] OK this is crazy [12:47:26] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.582 second response time [12:51:57] hmm, the compare_extraction for badwords is failing for tamil, and I'm not quite sure why - if the labs stuff is under control, would someone mind taking a look? https://github.com/schana/revscoring/blob/tamil/revscoring/languages/tests/test_tamil.py#L62 [12:52:10] AssertionError: [] != ['ஓத்த'] [12:52:22] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES: ORES extension jobs should just fail when scoring is errored not to throw exception - https://phabricator.wikimedia.org/T141978#2518743 (10Ladsgroup) p:05Triage>03High [12:54:29] schana, what's the rule that should match [12:55:19] I'm not sure I understand; I thought this is saying that it's finding a bad word in the blob of text I grabbed from a featured article [12:55:32] when that doesn't look to be the case [12:55:54] Well it depends on what test is failing [12:56:48] Looks like it's not catching a badword to me [12:57:11] https://github.com/schana/revscoring/blob/tamil/revscoring/languages/tamil.py#L29 [12:57:16] Maybe it's the one you commented out? [12:57:18] :P [12:57:40] AssertionError: ['ஓத்த'] != ['ஓத்தா'] [12:57:45] with it uncommencted [12:57:48] *uncommented [13:03:47] (03PS1) 10: Jobs fail instead of throwing error when score is not right [extensions/ORES] - 10https://gerrit.wikimedia.org/r/302703 (https://phabricator.wikimedia.org/T141978) (owner: 10Ladsgroup) [13:08:40] In what test? [13:09:09] badwords [13:09:17] https://github.com/schana/revscoring/blob/tamil/revscoring/languages/tests/test_tamil.py#L62 [13:10:01] Look like lines 14 and 15 are the culprit [13:10:15] Oh wait. One sec. [13:10:33] Sorry. https://github.com/schana/revscoring/blob/tamil/revscoring/languages/tamil.py#L29 [13:10:38] Lines 29 and 30 here. [13:10:46] You have a regex that will match just the first part of the word. [13:10:58] Followed by a regex that will match a longer bit of the word. [13:11:12] If you want the first regex to end at a word boundary, end it with "\b" [13:12:16] I'd recommend combining lines 29 and 30 into r"ஓத்தா?" [13:14:24] schana, ^ [13:15:06] This should take advantage of a greedy pattern in the regex [13:15:55] halfak: ack [13:17:18] 06Revision-Scoring-As-A-Service, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES, 10ORES, 07Wikimedia-Incident: [Spike] Should we make a model for ores in beta? - https://phabricator.wikimedia.org/T141980#2518777 (10Ladsgroup) [13:21:28] (03CR) 10Ladsgroup: "I haven't tested it out in a real environment yet but it looks straightforward. I will do it soon." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/302703 (https://phabricator.wikimedia.org/T141978) (owner: 10Ladsgroup) [13:22:44] I'm going to try rebooting the redis node in order to address these intermittent issues. [13:23:51] So this is going to set off some alarms [13:23:58] But then again, that's become the new normal :) [13:24:02] o/ Amir1 [13:24:03] Around? [13:24:14] halfak: yeah, reading here [13:24:28] I'm about to start the process of adding password to redis [13:24:38] o/ btw [13:25:06] Gotcha. Sounds good. This issue with labs is driving me crazy [13:25:13] I can't seem to figure out what the root cause is. [13:25:16] But I suspect redis. [13:26:42] halfak: in that case other nodes should go down too [13:26:51] they have been [13:27:02] I don't think this is the screwy -05 issue [13:27:56] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.088 second response time [13:27:56] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.170 second response time [13:28:05] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.058 second response time [13:28:37] speaking of the devil [13:29:02] I just rebooted the redis node [13:29:03] :P [13:29:06] And said this was coming [13:29:25] OK we're back up now [13:30:01] Looks like we're swapping a little on the redis node and it's introducing iowait spikes [13:30:05] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 384 bytes in 0.567 second response time [13:30:05] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 384 bytes in 1.081 second response time [13:30:05] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 368 bytes in 1.501 second response time [13:30:20] https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1470230397.891&target=ores.ores-redis-01.cpu.total.iowait [13:30:32] Waits going up to 20-40 is bad news [13:30:36] And could explain this [13:30:51] So I think we should lower the maxmemory for our cache down to 2.5GB [13:33:15] Weird. I don't see a clear location that we configure this [13:33:46] * halfak installs the puppet language package for atom [13:35:12] Found it! [13:38:02] akosiaris, have a minute to talk about https://gerrit.wikimedia.org/r/302705 and https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1470230397.891&target=ores.ores-redis-01.cpu.total.iowait ? [13:39:04] Actually, this shows the issue better: https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1470231514.188&target=ores.ores-redis-01.cpu.total.iowait&from=00%3A00_20160701&until=23%3A59_20160803 [13:39:41] Somehow, our IOwait on the labs redis node jumped up on Sunday [13:40:11] And we consistently started using swap https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1470231591.284&from=00%3A00_20160729&until=23%3A59_20160803&target=ores.ores-redis-01.cpu.total.iowait&target=ores.ores-redis-01.memory.SwapTotal&target=ores.ores-redis-01.memory.SwapFree [13:40:14] halfak: hmmm... maybe some rescheduling ... [13:40:25] er, I am on which physical host hosts the VM [13:40:54] * akosiaris looking [13:41:10] Cool. I guess it could be that too. [13:41:18] Man, this memory usage pattern is insane! https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1470231641.282&from=00%3A00_20160729&until=23%3A59_20160803&target=ores.ores-redis-01.memory.MemFree [13:41:41] It could be that redis' LRU strategy went crazy [13:42:03] Either way, i don't see lowering the maxmem hurting us. [13:42:38] o/ jem [13:42:40] Around? [13:42:48] halfak: btw. I made a patch for this: https://phabricator.wikimedia.org/T141978 [13:44:20] hmm, the iowait is for the / disk and not the redis one [13:44:37] as in vda, not dm-0 [13:44:45] akosiaris? I figured it was swap or the rdb file. [13:46:33] * halfak doesn't know what "vda" and "dm-0" are referring to [13:46:57] oh, /dev/vda => / and /dev/dm-0 => /srv [13:47:14] funny thing is that it's indeed redis... but redis is under /srv [13:47:28] Oh! / and in "/" :) [13:47:31] What about swap? [13:47:48] *as in [13:47:50] it is indeed swapping occasionally [13:48:00] Do you think that explains the IO wait spikes? [13:48:57] it's not swapping that much... [13:49:10] in other cases perhaps... but on this one ? ... doesn't look that much like it [13:49:19] it's more like the swap is a product of the situation, not the cause [13:49:42] I 'll disable temporarily the swap just to make sure [13:50:36] and there still is iowait, albeit somewhat lower [13:51:16] ah [13:51:28] so /dev/vda is ofc the entire disk.. not / [13:51:42] it's /dev/vda3 that's / .. [13:51:47] ok so it's redis for sure [13:52:49] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 1.747 second response time [13:52:49] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 1.754 second response time [13:52:50] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 1.774 second response time [13:52:55] Amir1, just checked out your patch. Awesome. :) [13:52:59] akosiaris, ^ was that you? [13:53:06] no, I 've done nothing yet [13:53:15] OK. We're having an iowait spike on redis-01 [13:53:26] top shows 30-50 [13:53:38] kswapd0 is going like mad [13:53:45] OK calmed down [13:54:00] can't be, I have disabled the swap [13:54:06] probably some old data ? [13:54:09] * halfak just reports what he sees [13:54:18] so up to 40MB/s for redis [13:54:37] why is it reading so much ? [13:54:49] reading should be most of what redis does [13:54:55] But that does seem like a lot [13:54:59] for a while I thought it was the aof file, but it is not [13:55:14] :) [13:55:23] I need to go, I'll be back in a few hours [13:55:24] Oh yeah good point. Should be reading from mem [13:55:24] o/ [13:55:27] o/ Amir1 [14:00:20] ok, so I got something... redis-instance-tcp_6380.service: main process exited, code=killed, status=9/KILL [14:00:32] ah here we are [14:00:35] OOM showed up [14:00:40] I should have figure that earlier [14:01:07] uptime [14:01:08] 14:01:00 up 37 min, 2 users, load average: 1.79, 2.06, 1.43 [14:01:17] halfak: did you restart the VM ? or someone else ? [14:01:18] So do you think we should drop the maxmem like I proposed? [14:01:22] I did akosiaris [14:01:28] Sorry if that was trouble [14:01:39] no I am just trying to compose a line of events [14:03:55] so OOM first showed up at 13:46:58 2016 UTC [14:04:23] that was while swap was still enabled [14:04:59] OOM killed redis then [14:05:18] redis got a SIGKILL(9) died and systemd tried to restart [14:05:55] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.643 second response time [14:06:06] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 0.618 second response time [14:06:14] and ofc OOM showed up again a few mins later. The fact I had disabled the swap did not help as there wasn't enough memory anymore as it seems for redis [14:06:24] so it was killed again and again by OOM [14:06:26] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 1.141 second response time [14:07:24] reenabling the swap gave the system enough room to start redis, but it is going to happen again [14:08:38] halfak: I suggest a larger VM as far as memory goes [14:09:01] it's already in 90% swap [14:09:50] er 80% swap [14:10:48] OK we can do that. [14:10:58] I'll try to spool up ores-redis-02 then? [14:11:41] sure. so read/write now is normal, however the iowait is still high [14:13:12] the SETEXs and GETs don't look really heavy so I am gonna look at the hosting physical host now [14:13:17] * akosiaris needs to figure out which one it is [14:15:20] \o/ no quota error [14:15:26] <3 labs [14:15:32] It's so nice to be able to spin up a VM [14:22:05] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.149 second response time [14:22:25] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.071 second response time [14:22:36] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 2675 bytes in 0.078 second response time [14:24:06] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 1.113 second response time [14:24:26] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 384 bytes in 0.588 second response time [14:24:37] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 1.135 second response time [14:25:20] Almost done setting up ores-redis-02 [14:25:30] that ^ was one more OOM killing redis [14:26:44] akosiaris, I'd like to copy over the cache from ores-redis-01 before we get started. Do you have any experience with that? [14:27:23] the physical host btw does have the same iowait as the redis server. so that maybe explains it. there was an upgrade yesterday which might have shifted things around [14:27:29] halfak: yeah, it's easy enough [14:27:35] but we need to shutdown redis to that [14:27:48] so, downtime [14:27:55] OK it looks like ores-redis-02 is good to go. We'll need to do some transfers. [14:28:08] As of right now, we're having a few seconds of downtime every few minutes [14:28:14] So, not that big of a deal, right? [14:28:18] ok [14:28:27] then give 5 mins and I 'll get it working [14:29:04] OK. We'll need to restart all of the web/celery nodes too so that they point to the new redis node. [14:29:21] that was in hiera somewhere, wasn't it [14:29:26] Yup https://wikitech.wikimedia.org/wiki/Hiera:Ores [14:33:13] I imagine I should wait to make that edit since we'll want to run puppet right after [14:38:37] Wikitech hiera edit is ready when we are [14:38:55] grrr... blocked on LOADING Redis is loading the dataset in memory [14:39:04] OOM rekilled it [14:39:08] Arg [14:41:16] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:42:08] I'd like to step away for 10 mins to have a shower at some point. I sort of jumped right into troubleshooting this morning. [14:42:16] akosiaris, is now a good time or should I hang out for a bit? [14:42:39] halfak: I have it, you can go [14:42:50] OK back soon [14:42:55] PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:07] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:07] expected ^ [14:58:59] halfak: I 'm done. I 've done the edit and force a puppet run everywhere [15:01:11] Back! [15:01:28] akosiaris, "ores::web::redis_host": ores-redis-01 [15:01:32] making the edit now [15:01:51] https://wikitech.wikimedia.org/w/index.php?title=Hiera%3AOres&type=revision&diff=816118&oldid=816117 [15:06:40] looks like this is not working. [15:07:07] halfak: Error 400 on SERVER: DNS lookup failed for 10.68.17.240 Resolv::DNS::Resource::IN::A at /etc/puppet/modules/scap/manifests/target.pp:98 [15:07:11] sigh.. what is this ... [15:07:15] yeah.. [15:08:03] From hiera: "scap::deployment_server": 10.68.17.240 [15:08:07] But I think that's a relic [15:08:13] We also have "service::uwsgi::deployment": fabric [15:09:20] I 've removed the scap::deployment_server ... [15:09:23] and of course it did not work [15:09:44] reverting.. it worked anyway for some reason on -05 [15:10:42] Weird. The workers are good too. [15:12:17] I'm trying to figure out how we disabled scap3 on labs [15:12:31] I figured it had to do with the "service::uwsgi::deployment": fabric [15:12:48] halfak: er, this is fabric ? isn't this the one that does not use the puppet config or something ? [15:13:15] not that I see ores-redis-02 anywhere... grrr [15:13:42] This uses puppet, but yeah, we don't load any custom configs for the labs instance from puppet. [15:13:55] No custom config is necessary in labs [15:17:15] so, a deploy is needed ? [15:18:00] ? [15:18:16] Not sure what you mean. FWIW, scap3 shouldn't touch our labs instances at all [15:18:50] oh, sorry, I was just asking if ores reads the /etc/ores/ configs on las [15:18:51] labs* [15:18:59] Nope [15:19:30] Hmm... i see it is there anyway [15:19:34] OH! [15:19:47] I see that the host is listed in the config. [15:19:53] It is reading that and it shouldn't need to [15:19:57] I'm working on a deploy [15:20:04] I 've just commited https://gerrit.wikimedia.org/r/#/c/302718/ [15:20:06] (still not sure why puppet is failing but I can work on this) [15:20:12] let's see if this fixes anything [15:20:29] Sure. That works [15:21:22] I'll fix the base config in ores-wmflabs-deploy too [15:21:30] We really shouldn't have a custom config at all on labs. [15:22:35] so /etc/ores/99-redis.yaml on worker-06 has ores-redis-02 now.. not sure if it is honored ofc [15:22:51] It is. But it shouldn't exist at all :/ [15:23:03] Oh well. [15:23:12] well, I don't see any traffic in either of the 2 redises yet [15:23:15] cache/queue [15:23:23] and I 've restarted the worker on worker-06 [15:31:42] I'm talking to Amir1 on another chat. It seems like the hack to make scap3 not complain is deep. [15:32:09] It seems like we could modify puppet to just read "service::uwsgi::deployment" and deploy scap3 via puppet [15:32:16] What do you think about doing that akosiaris? [15:35:08] what do you mean ? [15:35:40] "It seems like we could modify puppet to just read "service::uwsgi::deployment" and deploy scap3 via puppet" [15:35:46] I'm not sure how else to say it. [15:35:59] deploy scap3 ? [15:36:05] that's what I am not following [15:36:12] Oh! Sorry, my derp. I meant disable. [15:36:20] My eyes just skipped right over "deploy" [15:36:53] * halfak wants scap3 deploys to be a separate module, but that refactoring can wait for non-downtime. [15:38:05] still chasing what the bug is tbh, not sure if it would or not [15:38:56] Looking for the scap3 400 error bug? I think the bug is that we are still trying to use scap3. [15:39:15] heh [15:39:48] But seriously -- just to make this 100% clear: We are not using scap3 at all on wmflabs. [15:40:14] OK I've modified the /etc/ores/99-main.yaml files to point to the new redis and it seems we are back online [15:40:26] Back. But we still can't score anything [15:40:29] *bah [15:40:39] So it looks like I need to do this with the workers [15:40:56] I did that [15:42:23] hmm actually I think I 've found out a simple way [15:42:28] Og! Needs a restart! [15:42:28] lemme make a patch [15:43:41] We need to stop writing 99-main.yaml and 99-redis.yaml to /etc/ores too [15:49:57] * halfak pushes updated configuration to staging [15:56:49] so, puppet had not run on those boxes in a very long time [15:57:25] anyway I 've merged https://gerrit.wikimedia.org/r/#/c/302727/ which I am not in love with too much but it fixed the problem [15:57:42] I finally see requests hitting the redis db [16:00:37] RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 370 bytes in 0.864 second response time [16:02:27] akosiaris, where do I find the module code for service::uwsgi? [16:02:39] I'm looking for what that argument change does [16:03:05] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 384 bytes in 1.182 second response time [16:03:15] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.640 second response time [16:03:33] halfak: ops/puppet [16:03:43] :P [16:03:46] :P [16:03:50] But where in there? :D [16:03:51] it effectively disables scap3 [16:04:51] Ahh! I just found it in models/service/manifests/uwsgi.pp [16:04:52] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/service/manifests/uwsgi.pp [16:05:14] I was looking at modules/uwsgi/... [16:05:27] * halfak often gets lost in ops/puppet [16:05:31] yeah puppet naming is not the best [16:05:39] and is not our fault most of the times [16:05:57] for example include ::service means use modules/services/manifests/init.pp [16:06:12] but include ::service::uwsgi means use modules/service/manifests/uwsgi.pp [16:06:22] what for ... [16:06:42] Gotcha [16:07:00] "Note: this parameter will be removed onces ores.wmflabs.org stops using service::uwsgi" -- Maybe not all uwsgi should be deployed with scap3? Do you think we should have a whole separate uwsgi for labs systems that use alternative deployment methods of uwsgi? [16:08:06] service::uwsgi is actually a wrapper around uwsgi adding the various niceties we 've built over the years, like alerting, monitoring etc [16:08:19] using uwsgi directly is obviously a first class choice for labs [16:08:35] Oh! I gotcha. That makes sense then. [16:09:02] ofc beta should be using service::uwsgi [16:09:08] What do you think about doing some cleanup here with me sometime soon. [16:09:10] ? [16:09:15] I would love to [16:10:47] we 've lost more time to ship the damned config change than actually moving the data around [16:10:48] Cool. I'll start an etherpad of things that I think we need to do. [16:11:21] https://etherpad.wikimedia.org/p/ores_puppet_cleanup [16:27:29] akosiaris, I think https://etherpad.wikimedia.org/p/ores_puppet_cleanup is ready for your review [16:28:41] 06Revision-Scoring-As-A-Service, 10ORES, 06Operations, 07Puppet: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2519474 (10Halfak) [16:28:47] 06Revision-Scoring-As-A-Service, 10ORES, 06Operations, 07Puppet: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2519487 (10Halfak) p:05Triage>03Normal [16:31:10] 06Revision-Scoring-As-A-Service, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES, 10ORES, 07Wikimedia-Incident: [Spike] Should we make a model for ores in beta? - https://phabricator.wikimedia.org/T141980#2519511 (10Halfak) Right now, the testwiki model is a model for all intents and purposes.... [16:33:45] Alright I'm AFK for lunch. I imaging it's late for you akosiaris, so if I don't catch you later, have a good night! [19:30:21] 10Revision-Scoring-As-A-Service-Backlog, 10revscoring: Implement abstraction for Sparse Feature Vectors - https://phabricator.wikimedia.org/T132580#2520316 (10Halfak) OK. More progress. I imagined what it might look like to use the abstractions that I'm imagining. See https://gist.github.com/halfak/4b5e8569... [19:31:05] AFK while I run an errand. Back in 1 hour [19:31:19] Turns out I was a day off on my Global Entry appointment [19:31:23] Now, I'll do it for real. [20:58:12] back! [20:58:24] Took longer than expected, but everything went OK [21:07:48] 06Revision-Scoring-As-A-Service, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES, 10ORES, and 2 others: [Spike] Should we make a model for ores in beta? - https://phabricator.wikimedia.org/T141980#2520650 (10Danny_B) [21:10:28] 06Revision-Scoring-As-A-Service, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-ORES, 10ORES, 07Spike: [Spike] Should we make a model for ores in beta? - https://phabricator.wikimedia.org/T141980#2520653 (10greg) [22:06:46] 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Extrapolate memory usage per worker forward 2 years - https://phabricator.wikimedia.org/T142046#2520810 (10Halfak) [22:08:11] 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Extrapolate memory usage per worker forward 2 years - https://phabricator.wikimedia.org/T142046#2520824 (10Halfak) | **Date** | **Models added** | 2015-09-21 | enwiki-reverted, enwiki-damaging, enwiki-goodfaith, ptwiki-re... [22:33:21] 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Extrapolate memory usage per worker forward 2 years - https://phabricator.wikimedia.org/T142046#2520927 (10Halfak) | Date | RES memory (KB) | 2015-09-21 | 364028 | 2015-10-07 | 415364 Well... that's as far as I was able to ge...