[07:57:59] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180#3585227 (10Rical) In T141177, I expressed that Wikipedia main content losts sources when edits are reverted. ORES seems a good way to answer to that. [09:05:22] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180#3791674 (10Rical) About sources in edits, ORES could better estimate souces, edits, intentions of the author of the edit, and intentions of the patroler. [10:11:32] 10Scoring-platform-team, 10MediaWiki-Recent-changes, 10MediaWiki-Watchlist, 10ORES, and 4 others: Watchlist and RecentChanges failure due to ORES on frwiki and ruwiki - https://phabricator.wikimedia.org/T181006#3791756 (10Trizek-WMF) I've left a message to [[ https://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%B... [12:32:42] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] "The -1 is for the string[] mistake, as well as for some documentation I believe is incomplete. Otherwise the change looks very fine, and I" (037 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393620 (https://phabricator.wikimedia.org/T181334) (owner: 10Ladsgroup) [15:06:34] 10Scoring-platform-team (Current): Build mid-level WikiProject category training set - https://phabricator.wikimedia.org/T172321#3792660 (10Sumit) [15:06:36] 10Scoring-platform-team (Current): Generate mid-level WikiProject categories - https://phabricator.wikimedia.org/T179311#3792657 (10Sumit) 05Open>03Resolved a:03Sumit >>! In T179311#3723831, @Halfak wrote: > Looks like we don't include the top level category names yet. @Sumit said he'd like to do that in... [15:36:02] Awight ill [15:36:06] Work on it in a min [15:36:14] Zppix: No rush, thanks again! [15:36:27] Np [15:50:12] 10Scoring-platform-team (Current), 10Global-Collaboration, 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), and 3 others: Make ORES-consuming pages more robust to ORES errors - https://phabricator.wikimedia.org/T181191#3792785 (10awight) Confirmed that we also ha... [15:52:23] 10Scoring-platform-team (Current), 10Global-Collaboration, 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), and 3 others: Make ORES-consuming pages more robust to ORES errors - https://phabricator.wikimedia.org/T181191#3792796 (10awight) [15:52:45] 10Scoring-platform-team, 10ORES, 10Patch-For-Review: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441#3792800 (10awight) [15:53:00] 10Scoring-platform-team, 10ORES, 10Patch-For-Review: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441#3792802 (10awight) Dropping from the current workboard. [15:53:22] akosiaris: I have a fun change to make, if you want to think about deployment strategy: T181071 [15:53:22] T181071: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071 [15:53:51] akosiaris: I prepared some patches, but they can’t be applied without doing something transitional. [15:55:13] I was thinking, we could do something temporary like symlink along the lines of “ln -s deploy/venv deploy/../venv”, but that doesn’t quite give us a safe rollback. [15:56:55] 10Scoring-platform-team (Current), 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#3792806 (10awight) [15:58:35] awight: done [16:01:31] Zppix: commented [16:01:36] wiki-ai/ores#848 (Pix1234-patch-1 - 539ad93 : Devin/Zppix): The build was broken. https://travis-ci.org/wiki-ai/ores/builds/308503552 [16:02:32] I did add it to requirements awight [16:02:59] And project health is okay Ive used to use it when i first dabbled in pythob [16:03:04] Python* [16:06:29] * awight squints [16:07:33] ./setup.py:3:1: F401 'require_python_3' imported but unused [16:07:57] Weird [16:09:54] Zppix: lol did you see the “code”? [16:10:03] if sys.version_info[0] != 3: [16:10:03] print("This script requires Python 3") [16:10:04] exit() [16:10:15] Yes [16:12:30] Does this work if you pip2 install ores? [16:12:42] I haven’t figured out whether pip actually reads setup.py [16:13:09] Wheels don't run setup.py [16:13:19] No arbitrary code in wheels [16:13:48] darn, so there might not be a way to prevent py2 [16:18:08] awight: I am missing context about the "fun" change. what is it about ? [16:21:16] akosiaris: The motivation is that we need faster rollback. Currently, rollback includes a step where we rebuild the virtualenv from wheels, which takes a few minutes. I realized we could reasonably include the built virtualenv alongside source code so we can toggle deployed source much faster. [16:22:20] 10Scoring-platform-team (Current): Fix response processing logic in drafttopic.fetch_page_wikiprojects - https://phabricator.wikimedia.org/T181522#3792934 (10Sumit) [16:22:34] halfak: o/ [16:23:41] awight: hmm isn't the "to rollback to" venv already populated though on the hosts ? [16:23:49] 10Scoring-platform-team (Current), 10drafttopic-modeling: Fix response processing logic in drafttopic.fetch_page_wikiprojects - https://phabricator.wikimedia.org/T181522#3792968 (10Sumit) [16:23:57] it shouldn't, shouldn't it ? [16:24:04] so why is it taking so long to rebuild it ? [16:24:14] halfak: Low-priority question… I see that Redis key count has been steady over months. That seems to mean we’re dropping stuff from the bottom of the stack, so I’m curious whether it’s LRU or just the first-added keys? [16:24:33] lru [16:24:53] akosiaris: It’s not already populated, the old venv gets destroyed and we rewrite with each source revision. [16:25:04] 10Scoring-platform-team (Current), 10Research Ideas, 10drafttopic-modeling, 10artificial-intelligence: Revscoring tune does not recognize a set of labels as target - https://phabricator.wikimedia.org/T181163#3792973 (10Sumit) [16:25:09] sigh [16:25:14] akosiaris: cos it’s currently in /srv/deployment/ores/venv, rather then under ores/deploy [16:25:25] 10Scoring-platform-team (Current), 10Research Ideas, 10drafttopic-modeling, 10artificial-intelligence: Revscoring: Statistic for multilabel classification - https://phabricator.wikimedia.org/T181166#3792974 (10Sumit) [16:25:40] halfak: How is that set up? It’s a transparent thing inside of Redis? [16:25:53] ah so we have one venv total and not a venv per deploy [16:25:58] 10Scoring-platform-team (Current), 10drafttopic-modeling: Generate mid-level WikiProject categories - https://phabricator.wikimedia.org/T179311#3792975 (10Sumit) [16:26:07] akosiaris: Exactly, that’s what I’d like to fix. [16:26:13] 10Scoring-platform-team, 10Research Ideas, 10drafttopic-modeling: Publish Machine-Readable WikiProjects Dataset - https://phabricator.wikimedia.org/T175037#3792976 (10Sumit) [16:26:29] ok yeah makes sense. Let me review the approach then. Judging by a first look, it does look sane [16:26:37] 10Scoring-platform-team, 10Research Ideas, 10drafttopic-modeling: Create machine-readable version of the WikiProject Directory - https://phabricator.wikimedia.org/T172326#3792978 (10Sumit) [16:26:50] 10Scoring-platform-team, 10drafttopic-modeling: Efficient method for mapping a WikiProject template to the WikiProject Directory - https://phabricator.wikimedia.org/T172325#3792979 (10Sumit) [16:27:07] 10Scoring-platform-team (Current), 10drafttopic-modeling: Build mid-level WikiProject category training set - https://phabricator.wikimedia.org/T172321#3792980 (10Sumit) [16:27:16] awight: yeah. It's a redis setting. https://redis.io/topics/lru-cache [16:27:33] awight: :/ [16:34:59] halfak: Interesting, there’s a new “LFU” option we might look at if it ever mattered. [16:36:30] Least Frequently Used [16:38:00] Hmm... Not sure there's a problem to fix there. [16:38:19] I like LRU because if someone is running an analysis, they'll request, say, 10k revisions once. [16:38:32] Then come back a few days later and request the score for the same 10k revisions again [16:38:40] I wonder if there’s much heat on specific keys. There probably is among recent changes, based on other filtering criteria that select for spammy-seeming edits, but I bet that doesn’t matter cos the keys fall out of “recent” millions of edits later. [16:38:52] in an LFU, those keys may have been evicted, but in an LRU, they probably won't be. [16:40:04] Halfak and awight i dont think we can hard require py2 so maybe we ought to maybe more ways to make it clearer? [16:40:17] ? [16:40:24] make what cleaner? [16:40:30] Also why would we *require* py2? [16:40:41] Zppix: Rats! Thanks for taking a look. [16:41:07] halfak: py3 xD [16:43:29] Ahh... [16:44:29] Zppix: looks like your python_requires thing will eventually work, https://packaging.python.org/tutorials/distributing-packages/#python-requires [16:44:54] The user would need to have setuptools >= 24.2.0 or pip >= 9.0.0 [16:45:01] YAY [16:45:05] im not useless xD [16:45:11] But that is slowly becoming the norm, even on distros that will probably default to python 2 [16:45:22] Sorry you took on such a frustrating task ;-) [16:45:51] :P [16:46:22] I can keep the pr and just wait until we update to that then? [16:55:04] 10Scoring-platform-team (Current), 10drafttopic-modeling: Generate mid-level WikiProject categories - https://phabricator.wikimedia.org/T179311#3793141 (10Halfak) @Sumit, please move to the "done" column before closing tasks. We need this in order to consistently report what has been "done". [16:55:25] Awight is what i said above ok? [16:56:05] Zppix: Do you feel like building the wheel (pip wheel ores) and seeing how require-python-3 behaves, or generally seeing what the error message is? [16:56:34] I’m confused about why installing a wheel would throw an error if we’re not in fact executing arbitrary code. [16:56:39] * paladox wants a bank holiday for the royal wedding lol :) [16:56:42] I just realised today is GCI start [16:56:48] awight: I wil [16:56:53] paladox: pageantry :p [16:56:56] 10Scoring-platform-team (Current), 10drafttopic-modeling: Generate mid-level WikiProject categories - https://phabricator.wikimedia.org/T179311#3793153 (10Sumit) >>! In T179311#3793141, @Halfak wrote: > @Sumit, please move to the "done" column before closing tasks. We need this in order to consistently report... [16:56:57] I got to make sure im not missing crap [16:56:58] heh [16:58:18] halfak: Amir1: I was just introduced to the new Fundraising dashboard in Grafana, it has some fun graph types and concepts, https://grafana.wikimedia.org/dashboard/db/fundraising-overview?refresh=1m&orgId=1&from=now-1h&to=now [17:02:33] 10Scoring-platform-team (Current), 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3776331 (10Halfak) https://wikitech.wikimedia.org/wiki/Incident_document... [17:04:02] 10Scoring-platform-team (Current), 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3793193 (10awight) [17:04:04] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review, 10Wikimedia-Incident: Parallelize scap for ORES - https://phabricator.wikimedia.org/T181067#3793191 (10awight) 05Open>03Resolved Confirmed working! [17:06:44] 10Scoring-platform-team (Current), 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3793195 (10greg) @halfak: I might be wrong, but it looks like not all of... [17:06:58] brb [17:09:31] 10Scoring-platform-team (Current), 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3793201 (10awight) @greg Good point—I'd like to keep this parent task ac... [17:10:12] 10Scoring-platform-team, 10ORES, 10Patch-For-Review: Consistent TimeoutErrors when using Celery 4 - https://phabricator.wikimedia.org/T179524#3793203 (10awight) [17:11:28] 10Scoring-platform-team (Current), 10ORES: Test to see what Extension:ORES will do when it gets this null threshold result. - https://phabricator.wikimedia.org/T179845#3737973 (10awight) [17:11:32] 10Scoring-platform-team (Current), 10Global-Collaboration, 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), and 3 others: Make ORES-consuming pages more robust to ORES errors - https://phabricator.wikimedia.org/T181191#3793237 (10awight) [17:12:03] 10Scoring-platform-team (Current), 10ORES, 10Documentation, 10User-Zppix: Announce the ORES FAQ - https://phabricator.wikimedia.org/T181103#3779666 (10awight) Noting that we've discussed announcing again, on wikitech-l. [17:13:49] o/ legoktm [17:13:56] hi halfak [17:13:57] Just heard you're officially spending time on the ORES ext [17:14:04] Amir1, has been looking into refactoring it. [17:14:11] Are you in contact with him about that? [17:14:15] * halfak sits in tech mgmt [17:14:23] 10Scoring-platform-team (Current), 10drafttopic-modeling: Fix response processing logic in drafttopic.fetch_page_wikiprojects - https://phabricator.wikimedia.org/T181522#3793247 (10Sumit) https://github.com/wiki-ai/drafttopic/pull/13 [17:14:40] ^ <3 [17:15:01] not yet, but I've seen his tickets [17:15:50] 10Scoring-platform-team (Current), 10ORES: id string in ORES does not include "features" - https://phabricator.wikimedia.org/T179509#3793250 (10awight) Cherry-picking while Celery 4 work is on hold. https://github.com/wiki-ai/ores/pull/236 [17:16:18] 10Scoring-platform-team, 10ORES: ORES service should return a readable error when MW API read times out - https://phabricator.wikimedia.org/T179117#3793252 (10awight) [17:16:28] 10Scoring-platform-team, 10ORES, 10Services (watching): ORES internal server error for edit with many added links - https://phabricator.wikimedia.org/T179064#3793253 (10awight) [17:17:14] wiki-ai/ores#850 (features_task_id - e209d40 : Adam Roses Wight): The build passed. https://travis-ci.org/wiki-ai/ores/builds/308541597 [17:18:32] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review, 10Wikimedia-log-errors: Notice: Undefined property: stdClass::$ores_damaging_threshold in /srv/mediawiki/php-1.31.0-wmf.6/extensions/ORES/includes/Hooks.php on line 602 - https://phabricator.wikimedia.org/T179830#3793259 (10awight) a:03awight [17:20:45] 10Scoring-platform-team, 10ORES, 10Services (watching): ORES internal server error for edit with many added links - https://phabricator.wikimedia.org/T179064#3793273 (10awight) This was a glitch caused by an API mismatch between the Celery 4 library and Celery 3 code. [17:20:56] 10Scoring-platform-team, 10ORES, 10Services (watching): ORES internal server error for edit with many added links - https://phabricator.wikimedia.org/T179064#3793276 (10awight) 05Open>03Resolved a:03awight [17:21:27] 10Scoring-platform-team (Current), 10ORES, 10Services (watching): ORES internal server error for edit with many added links - https://phabricator.wikimedia.org/T179064#3712140 (10awight) [17:21:40] 10Scoring-platform-team (Current), 10MediaWiki-Recent-changes, 10MediaWiki-Watchlist, 10ORES, and 4 others: Watchlist and RecentChanges failure due to ORES on frwiki and ruwiki - https://phabricator.wikimedia.org/T181006#3793285 (10awight) [17:21:48] 10Scoring-platform-team (Current), 10Scap: Deployment fails with something gitty - https://phabricator.wikimedia.org/T179987#3793287 (10awight) [17:22:05] 10Scoring-platform-team (Current), 10Community-Liaisons, 10JADE: Set up working group for JADE - https://phabricator.wikimedia.org/T170954#3793289 (10awight) [17:22:15] 10Scoring-platform-team (Current): Bug: ORES thresholds API fails - https://phabricator.wikimedia.org/T176830#3793291 (10awight) [17:22:27] 10Scoring-platform-team (Current), 10Wikidata, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Sprint-2017-11-22: ORES thresholds for Wikidata is too strict - https://phabricator.wikimedia.org/T180450#3793293 (10Sjoerddebruin) 05Open>03Resolved Seems better now. [17:23:57] 10Scoring-platform-team (Current), 10ORES, 10Documentation, 10User-Zppix: Announce the ORES FAQ - https://phabricator.wikimedia.org/T181103#3793316 (10Halfak) [17:25:12] Amir1: If this looks good to you, I’ll deploy to beta: https://gerrit.wikimedia.org/r/#/c/393801/ [17:27:05] +1 ^ [17:27:38] kk [17:28:18] awight, at some point in the near future, I'd like to have you continue work towards deploying on the new ores* cluster. Maybe we can plan for that once you've addressed the ORES Ext apocalypse fallout. [17:28:35] *we have addressed ;) [17:28:53] Hmm... Maybe I could continue where you left off today. [17:31:05] halfak: Remind me where to !log for beta changes? -releng? [17:31:13] right [17:31:22] Yes [17:31:32] ty [17:31:54] * awight makes a note to document that [17:32:14] mmm that wasn’t quite right [17:32:17] That logs to https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:32:26] Im shocked they dont have stashbot for you in here [17:32:30] awight: !log beta [17:32:38] Iirc [17:32:59] But check this out: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep [17:33:02] ah lemme try that [17:33:21] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:37] that wasn’t it either. [17:33:46] ?? [17:34:00] Idk [17:34:04] I think [17:34:10] Grafana looks okay [17:34:11] Beta uses relengs [17:34:21] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 9.546 second response time [17:34:37] ? [17:34:49] Awight the socket timeout doesnt mean really something is wrong all the time it just means grafana and icinga had a derp with the check [17:34:54] madhuvishy: I see that you know how to log to the beta cluster SAL… Would you mind teaching me? [17:35:55] oh! We've been logging to releng's SAL historically. [17:36:13] harrum [17:36:18] woops [17:36:35] Lmao i can fix relengs sal for you awigh [17:36:37] I donno even know. Only that the beta cluster page shows a mysterious SAL that seems unrelated. [17:36:39] awight: * [17:36:50] Zppix: It would be helpful to point me to the config! [17:36:59] * awight learns to phish [17:37:06] I think you just use relengs sal [17:37:14] baahaaha [17:37:22] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:23] We’ll flood them with love [17:38:18] awight: i fixed sal so theres not dupe logs :P [17:38:22] OK. I don't know this "worker production" error. [17:38:52] halfak: i think its just icinga derping if theres not any noticable problem [17:39:01] I know we get those errors alot from icinga2-wm sometimes [17:39:03] halfak: Nothing in Grafana [17:39:20] And !log deployment-prep should also work in cloud? Since it's a project [17:39:49] * halfak edits docs to add "deployment-prop" [17:41:28] 10Scoring-platform-team (Current), 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Write reports about why Ext:ORES is helping cause server 500s and alternatives to fix - https://phabricator.wikimedia.org/T181010#3793388 (10awight) [17:41:32] 10Scoring-platform-team (Current), 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: ORES beta cluster config should be as close to production as possible - https://phabricator.wikimedia.org/T181187#3782311 (10awight) 05Open>03Resolved [17:41:38] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180#3793390 (10awight) [17:41:40] 10Scoring-platform-team (Current), 10ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), 10Patch-For-Review: Rewind revscoring 1/2 compatibility hacks - https://phabricator.wikimedia.org/T179602#3793389 (10awight) 05Open>03Resolved [17:41:56] madhuvishy: Thanks! [17:42:02] https://wikitech.wikimedia.org/w/index.php?title=ORES/Deployment&diff=1776771&oldid=1776769 [17:44:22] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 0.537 second response time [17:44:51] We need to know what this means ^ [17:45:13] I can't get ORES to fail to generate a score. [17:45:29] Paladox would you be able to answer ^ [17:45:43] Yass https://www.mediawiki.org/w/index.php?title=Beta_Cluster&type=revision&diff=2631039&oldid=2583867 [17:45:57] https://grafana.wikimedia.org/dashboard/db/ores?panelId=2&fullscreen&orgId=1&from=now-1h&to=now-1m [17:46:03] Grafana does not look OK [17:46:04] halfak: I think Zppix is right, that labs is having some trouble connecting to production ORES [17:46:06] oh [17:46:12] eqiad is having problems. [17:46:19] Those are errors :) [17:46:23] Great... [17:46:38] * awight consults logstash [17:46:46] scb1001 and scb1002 [17:47:14] Zppix ? [17:47:56] CPU is at 100% [17:48:10] paladox: i think we got it nvm [17:48:15] halfak: Overload errors is also up [17:48:17] ok [17:48:20] 1.8GB of memory free on scb1001 [17:48:27] awight, that's because we have two workers down [17:48:34] halfak: scb may need restart or services on it need to [17:49:07] halfak: You think I should go kick the servers? [17:49:14] awight, no idea what this is. [17:49:18] Trying to figure it out. [17:49:21] We can autopsy later... [17:49:35] awight, ping -operations maybe? [17:49:46] akosiaris, ^ [17:49:53] +1 for fix now. [17:50:10] Why is varnish not failing over to codfw? [17:50:13] halfak: I don’t see any memory issues [17:50:24] cache hit rate is terrible [17:50:29] only ores* machines have had low memory [17:50:41] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:59] There was no traffic spike FWIW [17:50:59] We're getting a massive increase in score requests [17:51:04] Yes there was [17:51:16] https://grafana.wikimedia.org/dashboard/db/ores?panelId=1&fullscreen&orgId=1&from=now-3h&to=now-1m [17:51:17] wat—can you show me the graph? [17:51:32] Started right before the problems started [17:51:32] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 2.530 second response time [17:51:36] ty for the time scale [17:52:00] codfw should be taking on the capacity [17:52:04] hey, what's up ? [17:52:24] ORES is having issues. We're getting a spike in traffic and eqiad is struggling. [17:52:31] Trying to think of a short term solution [17:52:41] My sense is that redirecting traffic to codfw would work [17:52:49] codfw has more capacity and is currently unused. [17:55:30] akosiaris: seen on scb1001: OSError: write error [17:56:18] halfak also ^ [17:56:54] halfak: wat I think it’s just this, [pid: 27780] 10.64.32.97 (-) {46 vars in 916 bytes} [Tue Nov 28 17:55:00 2017] GET /scores/nlwiki/damaging/?model_info=test_stats&format=json => generated 106 bytes in 1 msecs (HTTP/1.1 400) 6 headers in 217 bytes (1 switches on core 0) user agent "MediaWiki/1.31.0-wmf.8" [17:57:23] 10Scoring-platform-team (Current), 10drafttopic-modeling: Build mid-level WikiProject category training set - https://phabricator.wikimedia.org/T172321#3793439 (10Sumit) We now have a dataset at figshare - https://doi.org/10.6084/m9.figshare.5640526.v1 \o/ [17:57:39] awight, that request doesn't use celery and it returns really fast [17:57:41] I spoke too early—there was a huge burst of those, but that’s not the only type of request hitting scb1001 [17:58:37] 10Scoring-platform-team: Why is nlwiki requesting the old-style thresholds API? - https://phabricator.wikimedia.org/T181534#3793451 (10awight) [17:58:51] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:51] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 8.543 second response time [18:00:30] 10Scoring-platform-team, 10Icinga, 10ORES, 10monitoring: ORES worker icinga message not specific enough - https://phabricator.wikimedia.org/T181536#3793489 (10Halfak) [18:01:30] 10Scoring-platform-team, 10Operations: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793518 (10awight) [18:03:18] 10Scoring-platform-team, 10Operations: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793540 (10awight) [18:07:08] halfak: well, looks like the phantom OOM is undead. Should we reduce # of workers? [18:12:09] awight, not sure we should do anything [18:12:16] I'm not sure why we aren't just switching to codfw [18:12:25] But I'll follow akosiaris' lead. [18:12:38] I can ask questions later :) [18:13:21] halfak: Well you convinced me that I should take another look at the new cluster, sitting idle by the switchyard. [18:13:43] Decoupling the revscoring 2 and celery changes should have gotten us out of the weeds on that. [18:13:59] I’ll also drop worker count on those like we were discussing a while ago. [18:14:51] +1 awight [18:16:25] halfak: We should have a private graph with top user agents and IPs [18:17:29] awight, agreed [18:17:58] awight: i think that would require a seperate grafana install [18:21:10] 10Scoring-platform-team, 10monitoring: Monitoring for top IPs and User-Agents hitting the ORES service - https://phabricator.wikimedia.org/T181542#3793626 (10awight) [18:21:15] Zppix: good point. [18:22:25] halfak: I’m wondering why we would be getting timeout errors if there’s still capacity on two of our boxes. [18:22:42] Looks like varnish isn't failing over [18:22:50] Either to the two good boxes or to codfw. [18:22:55] Is varnish even load balancing? [18:22:55] I guess scb[1-2] is grabbing jobs and not finishing them [18:22:57] huh [18:22:58] I thought it would do so automatically. [18:26:23] 10Scoring-platform-team, 10Operations: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793657 (10awight) We saw very different available memory levels using `top` directly on scb100[1-2], vs. the ORES Grafana dashboard which never showed a dip below c. 20GB. This needs to be fi... [18:31:43] (03PS1) 10Awight: Remove unprovisioned servers [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/393822 [18:34:56] 10Scoring-platform-team, 10Operations, 10Patch-For-Review: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793518 (10Halfak) Strange memory behavior on scb1001 and scb1002 for the last week: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-da... [18:35:29] 10Scoring-platform-team, 10Operations, 10Patch-For-Review: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793712 (10Halfak) scb1001 and scb1002 had OOMs show up. ``` [12:04:04] now on scb1001 OOM showed up ... [12:05:32] and scb1002 [12:05:41] 10Scoring-platform-team, 10Operations, 10Patch-For-Review: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793728 (10Halfak) We had a sudden increase in requests/min for ORES around 1600 UTC. But we've seen bigger spikes that did now cause timeouts or memory issues around 130... [18:38:45] halfak: fyi, just fixed the timeout moving averages in Grafana. [18:39:07] I'm making notes in https://phabricator.wikimedia.org/T181538 [18:39:10] Anything I missed? [18:39:44] LGTM! [18:39:54] 10Scoring-platform-team, 10Operations, 10Patch-For-Review: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793732 (10Halfak) We've failed over to CODFW. [18:42:17] awight, what do you think is going on here? https://grafana.wikimedia.org/dashboard/db/ores?panelId=14&fullscreen&orgId=1&from=1511872559429&to=1511894099429 [18:42:27] Issues start at ~1600 UTC [18:42:32] 10Scoring-platform-team, 10Operations, 10monitoring: Investigate scb1001 and scb1002 available memory graphs in Grafana - https://phabricator.wikimedia.org/T181544#3793736 (10awight) [18:43:30] halfak: Perhaps most weirdly, those show memory usage *dropping* on scb100[1-2] [18:43:42] and increasing on the healthy machines. [18:43:43] awight, could be random OOM kills? [18:44:01] There’s no spike leading up to it though. But yeah maybe,. [18:44:03] awight, that's what I'm seeing too. [18:44:17] Zooming in, definitely no spike. [18:44:17] Is there a graph of OOM? [18:44:21] halfak I am currently at a kubernetes meetup, I 'll be on and off, but laptop will be turned on and I 'll be taking a look. Please ping me directly if there is something urgent [18:44:41] oom are not graphed, nor can they be btw [18:44:55] akosiaris, right. Just want to put them on a time scale :( [18:44:56] it's the kernel showing up and killing a process due to extreme memory pressure [18:45:08] I can give you times looking at the logs [18:45:18] akosiaris, thanks for your time here. [18:45:21] Oh that would be helpful. [18:45:32] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793766 (10greg) [18:45:55] I'm trying to figure out if the increase in traffic instigated this or something was up beforehand. [18:47:04] A whole lot of fork(): Cannot allocate memory [core/master_utils.c line 724] [18:47:22] Why at INFO level... [18:47:31] that should be at WTF level [18:47:39] if you are looking at logstash [18:47:44] it's because it's hardcoded [18:47:52] at the uwsgi configuration [18:47:58] ooh k [18:48:00] we should fix that [18:48:10] application should be setting the level, not uwsgi [18:49:39] awight, please note the timestamps for that in the task. :) [18:49:42] 10Scoring-platform-team, 10Operations: Let the ORES application set log severity, not uWSGI - https://phabricator.wikimedia.org/T181546#3793772 (10awight) [18:49:56] halfak: cool [18:50:11] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793785 (10Halfak) From https://grafana.wikimedia.org/dashboard/db/ores?panelId=14&fullscreen&orgId=1&from=1511872559429&to=1511894099429, we can s... [18:50:18] Damn. I wanted to eat lunch. [18:50:23] I'm going to run away for a few minutes. [18:53:20] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793822 (10awight) Looks like we've been hitting memory limits for quite a while, at least since Oct 26th: https://logstash.wikimedia.org/goto/4e64... [18:53:34] halfak: /o [19:00:11] awight, looks like the answer is fewer workers after all. [19:00:28] And, GTFO off those shared boxes. [19:00:33] I wonder who increased their memory usage on the machine :/ We dropped ours a little bit [19:00:35] Right [19:00:35] that [19:00:38] so much that. [19:02:17] Unfortunately, I’m still blocked from even deploying to our machines. [19:04:21] #freedomforores2k17 xD [19:04:37] awight, wait... why is that? [19:05:05] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10User-Joe, 10User-Ladsgroup: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [19:05:21] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [19:05:25] halfak: e.g. ^ [19:05:29] right [19:05:40] The deployment to those machines has been real glitchy. [19:06:24] on that note, can you CR this easy one? The machines aren’t ready. https://gerrit.wikimedia.org/r/#/c/393822/ [19:07:00] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [19:09:51] 10Scoring-platform-team (Current), 10ORES, 10Operations, 10Release-Engineering-Team, and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793924 (10awight) This could be another submodule rewriting problem. I see that .gitmodules has been modified on the ta... [19:10:21] * awight looks for a drink of WTF [19:16:31] awight: did you fetch this in tin? https://gerrit.wikimedia.org/r/#/c/393801/ [19:17:01] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:08] OH fuck [19:17:10] Amir1: No, good point. I let it hit labs but haven’t been polite enough to merge [19:17:15] halfak: Wheee [19:17:44] What is this, the first day of the goldang fundraiser? [19:17:52] awight: I know but you need to fetch it in tin in prod as well otherwise we will have angry RelEng [19:17:58] because the repo is the same [19:18:06] Amir1: totally, ty for pointing that out! [19:18:16] awight: let me do it [19:18:35] legoktm: hey, when do you have some time to talk about the extension? [19:18:45] halfak: Well that’s every cylinder eh [19:19:10] Everything is failing in CODFW!? [19:19:11] WTF [19:20:29] Can't ssh to scb2001 [19:21:52] Nobody in… operations [19:21:55] whoa [19:22:50] everything is going to shit it seems not just ores [19:23:00] wmf is having a major outage [19:23:07] I'm happy it' [19:23:09] s not us [19:23:13] but generally unhappy [19:23:33] Was just on a call with DarTar. "I need to go. I think codfw just disappeared." [19:24:42] fundraisers are going to cry. [19:26:01] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 899 bytes in 5.655 second response time [19:27:00] halfak: Where are the massive number of test_stats requests coming from? UA is MediaWiki which don’t make any damn sense. [19:27:15] Amir1, ^ [19:28:20] halfak: weird [19:28:37] did we change anything in mediawiki? I don't think so [19:28:50] also we can put these stuff behind varnish [19:29:05] Hey. php-1.31.0-wmf.8/extensions/ORES/includes/StatsV1.php [19:29:29] That deployed code is older than I thought [19:29:34] Probably cos we broke the train. [19:30:21] Oooh—OH [19:30:30] When the new thresholds code fails, it falls back to old thresholds code [19:30:51] And that seems to be unthrottled when the cached thresholds are empty? [19:32:46] halfak: fwiw I get responses from https://ores.wikimedia.org/v3/scores/enwiki/123456?features= [19:33:01] Also cached responses without the ?features= [19:35:30] Let’s bump T175654 into the current board perhaps. [19:35:30] T175654: ORES service gets stuck reporting "server overloaded" even after load returns to normal - https://phabricator.wikimedia.org/T175654 [19:37:47] awight, focused on current issue. [19:38:13] halfak: Of course. This came to mind because I think we handle overload ungracefully. [19:39:07] awight, any hypothesis about what just happened? [19:39:15] And why it happened to 6 machines all at once? [19:39:20] Both problem and recovery [19:39:22] None at all, I’m reading the logs [19:39:25] kk [19:39:55] My only thought is that when we get into an overload condition, our ice palace melts in the hot rain [19:40:23] I don’t even have a mental model of what *should* happen in that case. [19:41:33] awight, to me, it looks like celery just stopped taking on jobs. [19:41:38] It doesn't look like overload. [19:41:54] Or rather, it's not overload from too many requests building up. [19:42:05] that’s believable [19:42:08] https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen&orgId=1&from=now-1h&to=now-1m [19:42:21] CPU usage just goes whacko all of a sudden. [19:42:38] Grr, did I mention that I’d rather have access to logfiles than logstash... [19:43:23] I’m seeing thousands of the test_stats requests. [19:43:30] I think we have to plug that hole. [19:44:26] +1 [19:44:35] It’s a chain-reaction, when a wiki’s thresholds cache expires, if the request fails, we fallback to old thresholds. Which now returns a 400, and the old code doesn’t cache the error. Instead, every S:RC page requests test_stats [19:44:59] Chad is about to deploy wmf.10 which has removed the “v1” thresholds code. [19:46:49] No I’m wrong. The train won’t get to WP until Thursday. [19:47:37] Amir1: halfak: The Services window is already moved? [19:47:56] ? [19:48:17] Tricky—I see. It’s at a different time on MW train days. [19:48:47] awight: One thing is that the window moves throughout the week [19:49:04] but stays the same during different weeks so Mondays is always one time [19:49:11] kk [19:49:11] but it might differ from Tuesday [19:50:06] I want to emergency-deploy Ext:ORES but it’s quite a bit different than wmf/1.31.0-wmf.8 [19:50:49] This is the only thing I care about though, + 21ad7eefe3107a6a213987c043116d1f86dfe557 Revert "Fallback to old thresholds API as necessary (take 2)" [19:51:15] Or—can anyone think of a safer way to do this? [19:51:24] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794152 (10Halfak) All nodes in CODFW just went down at the same time. For a short period. See * [Overload errors](https://grafana.wikimedia.org... [19:51:47] Goal: stop the flood of > GET /scores/enwiki/damaging/?model_info=test_stats&format=json => generated 106 bytes in 1 msecs (HTTP/1.1 400) [19:51:50] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794155 (10Halfak) My hypothesis is that our redis nodes had a network blip and that caused all scoring requests to back up for a period. [19:52:04] \o/ Single points of failure! [19:52:49] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: Investigate redis-cluster or other techniques for making Redis not a single point of failure. - https://phabricator.wikimedia.org/T181559#3794157 (10awight) [19:56:25] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180#3794183 (10awight) [19:56:29] 10Scoring-platform-team (Current), 10ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), 10Patch-For-Review: Rewind revscoring 1/2 compatibility hacks - https://phabricator.wikimedia.org/T179602#3730423 (10awight) 05Resolved>03Open Noticed that this code isn't deployed yet. It's act... [19:56:56] 10Scoring-platform-team (Current), 10ORES, 10MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), 10Patch-For-Review: Rewind revscoring 1/2 compatibility hacks - https://phabricator.wikimedia.org/T179602#3794184 (10awight) We could cherry-pick into wmf.8, or wait for the train to pick this up on... [20:11:09] halfak: That oresrdb network graph makes it look like the service was spiking like crazy, and when it overloaded nothing bothered to check in with Redis any more. [20:11:24] The bad stuff all happens before the gap in network traffic. [20:12:06] But there was no spike in traffic -- just in redis network. [20:12:44] There was a spike in scores errored [20:12:54] 19:15-19:20 [20:13:02] Oh sure. During the network drop [20:13:05] Also timeout errors [20:13:06] But not before [20:13:12] no, the network drops 19:20-19:25 no? [20:13:22] yeah. [20:13:33] Oh good point [20:13:47] Could those spikes could be the 400 crap from the faulty fallback code? [20:14:10] Do threshold requests go through the same Celery paths? [20:14:24] And web workers spikes at 19:14 [20:14:36] threshold requests do not touch celery at all. [20:14:45] They can respond even when celery is down. [20:14:54] That’s hurting my theory. [20:15:07] * halfak double-checks [20:15:41] * halfak starts up ores_wsgi.py but not ores_celery.py on localhost [20:16:11] halfak: Check this out: https://logstash.wikimedia.org/goto/36c5fdfa3b3bf983b3a16c2acd91ccbe [20:16:20] awight, confirmed [20:16:47] It’s an avalanche of test_stats requests, about 400/sec [20:16:51] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:54] OH! [20:17:03] * awight pretends to stay calm [20:17:04] It's filling up the uwsgi worker pool maybe [20:17:08] F---- [20:17:32] Chad will fucking die if we ask to emergency deploy this fix during his train window... [20:17:37] But maybe immediately afterwards. [20:17:51] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 883 bytes in 8.079 second response time [20:18:05] Only thing is, I want to be certain about what will happen with the *new* code if it wants thresholds and the server is down. [20:18:13] It might be no better. [20:18:22] * awight tries in vagrant [20:18:45] My stress level is through the roof today. [20:18:49] I'm going to take a short walk [20:18:51] back in a bit. [20:18:57] please do [20:18:59] see ya! [20:20:52] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:06] OK couldn't actually walk away for more than a minute. [20:21:22] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794219 (10awight) @Dzahn found a clue to the latest *codfw* outage, in which oresrdb Redis network traffic spikes and then crashes to zero: ``` [3... [20:22:49] I think we have to get off codfw, right? [20:23:31] it was just OOMs on the two machines in eqiad, and akosiaris already set LB weights so we’re favoring the good legs. [20:23:38] I think it's worth a shot. [20:24:26] mutante’s on it. [20:24:51] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 898 bytes in 1.111 second response time [20:25:28] Well this is fun. [20:27:35] Right. 8| [20:30:12] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3794273 (10Halfak) [20:34:22] 10Scoring-platform-team (Current), 10drafttopic-modeling: Build mid-level WikiProject category training set - https://phabricator.wikimedia.org/T172321#3494777 (10Halfak) Great work! [20:38:58] * halfak crawls into his cave to work on JADE for a little while [20:39:08] No more things breaking today [20:39:12] OK? [20:39:18] +1 I have the same instinct [20:57:51] * awight plays in mw-vagrant sandbox where nobody can hear me scream [21:02:23] I'm off for the day [21:02:26] see you around [21:02:27] o/ [21:02:28] o/! [21:14:51] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:00] halfak: awight: so I wee failed back from codfw once more [21:15:08] and we still are having problems, right ? [21:15:15] at least according to ^ [21:15:41] It looks like same bug that started after failover... [21:20:02] we mean that redis stops responding or something ? [21:20:07] I am still reading backlog [21:21:15] akosiaris: In the earlier network traffic graph, it had a few odd spikes, then cut out for a few minutes. [21:21:53] It seems to be a different problem than we saw on eqiad. [21:22:15] And redis was the best explanation since all workers stopped serving requests at the same time. [21:22:33] well the current alert is about eqiad, right ? [21:22:45] nothing to do with codfw [21:22:51] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 5.026 second response time [21:22:57] so we are suffering in eqiad as well [21:23:02] * halfak looks [21:23:27] Confirmed. Looks like it did in codfw. [21:23:57] For some reason, scb1002 is serving traffic. [21:24:03] https://grafana.wikimedia.org/dashboard/db/ores?panelId=3&fullscreen&orgId=1 [21:24:05] yeah it's pooled [21:24:26] both are actually, just at a lower weight than scb1003, scb1004 [21:24:32] Oh I see. [21:24:36] and with workers stopped [21:24:49] not like uwsgi has problems [21:25:18] that's one thing I would like to split btw when we move to dedicated cluster [21:25:32] have different hosts do the celery dance and different ones do the uwsgi dance [21:25:35] * halfak turns on his cortisol again [21:25:40] That pybal thing is only changing where the web requests land, right? [21:25:41] so that they don't step on each other's feet [21:25:45] yes [21:25:51] has nothing to do with celery [21:25:56] +1 We were thinking of leaving uwsgi on scb and moving all celery to ores* [21:26:01] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:26:03] OK. Celery then distributes the work evenly across machines. [21:26:38] [[:en:Cortisol]] [21:26:39] 10[1] 04https://meta.wikimedia.org/wiki/:en:Cortisol [21:27:24] I was looking at https://github.com/antirez/redis/issues/1019 but it has been "fixed" since 2.6 [21:27:25] we run 2.8 [21:28:30] btw that issue only seems to hit the queue redis [21:28:32] not the cache redis [21:30:00] I can disable the persistency [21:30:07] 10Scoring-platform-team (Current), 10ORES: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711#3733823 (10Framawiki) {T181006} is closed, is this task done too ? [21:30:09] that should make redis keep the queue in cache [21:30:23] and avoid the fsync issue [21:30:30] but if redis dies, we loose the queue [21:30:33] lose* [21:30:38] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3794457 (10Halfak) ``` [15:27:23] I was looking at https://github.com/antirez/redis/issues/1019 but it has been "fixed" s... [21:30:43] halfak: how bad would that be ? [21:31:04] I am guessing not too bad, but I 'd like a confirmation [21:31:19] 10Scoring-platform-team (Current), 10ORES: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711#3794458 (10Halfak) It is. [21:31:26] 10Scoring-platform-team (Current), 10ORES: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711#3794459 (10Halfak) 05Open>03Resolved a:03Halfak [21:31:28] 10Scoring-platform-team (Current), 10ORES: Deploy ORES early Nov 2017 - https://phabricator.wikimedia.org/T179837#3794461 (10Halfak) [21:31:52] akosiaris, just to confirm, we'd lose the queue with a restart? [21:31:57] yes [21:32:05] That would actually be desirable :) [21:32:08] lol [21:32:11] really ? [21:32:14] ok then [21:32:16] haha [21:32:31] Usually when things are so bad that we reboot, we don't want any of that old crap clogging things up. [21:32:52] I can't imagine a hot reboot of redis where we'd want to keep the queue [21:32:53] halfak: It might hose the state of Celery though [21:33:07] i.e. we would also need to restart all the celery managers [21:33:16] If it drops everything for that redis instance, celery would recover. [21:33:30] Oh right. Yeah. Everything that touches celery would need a reboot. [21:33:38] But I don't think the alternative is better. [21:33:54] +1 corrupt or locked Redis anyway [21:33:59] Not reboot. restart [21:34:01] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 1.619 second response time [21:34:47] I'm worried that we won't find a solution that sticks until we figure out why this suddenly happening across datacenters. [21:34:52] halfak: So the bad news about T179602 is that the new code has the same problem. [21:34:52] T179602: Rewind revscoring 1/2 compatibility hacks - https://phabricator.wikimedia.org/T179602 [21:35:03] ok I am gonna disable queue persistency for a while [21:35:12] and let's see what we get out of it [21:35:18] awight, Same massive requests for model_info=test_stats? [21:35:22] akosiaris, +1 [21:35:24] funny thing is that this is just a bottleneck right now... [21:35:29] well it’s for the new API but yes. [21:35:32] we could hit some other issue later [21:35:33] Make sure to only do it on the queue [21:35:43] and only in eqiad for now [21:35:53] akosiaris, right [21:36:00] halfak: We’ll have a cache stampede on the new-style thresholds endpoint. [21:36:26] ahh. awight, any way we could just address that right now? [21:36:29] Halfak are we sure its not our code? Why are we so sure its the servers? [21:36:33] And still get in the train? [21:37:02] Zppix, uh. That's a surprisingly complicated question. [21:38:12] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:38:16] 10Scoring-platform-team: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3794478 (10awight) [21:38:49] halfak: Yeah I can write a patch for ^. The train is Thursday, though. [21:39:10] If we’re feeling confident, we can do an emergency deploy of the extension... [21:40:21] Oh hey we have stats for this! [21:44:19] we got redis persistence for the queue stopped since 21:39 UTC btw [21:44:30] But the threshold fetching stats don’t show anything crazy: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1 [21:44:33] That can’t be right [21:48:48] 10Scoring-platform-team: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3794508 (10awight) We aren't seeing these huge floods of failed thresholds requests in the metrics, because we're falling back to the old-style thresholds code which is missing the metrics... [21:53:21] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 833 bytes in 8.042 second response time [21:56:13] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.015 second response time [21:59:21] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 1.077 second response time [21:59:22] halfak: Okay there’s a feasible fix. Working on the patch... [22:10:03] 10Scoring-platform-team: Why is nlwiki requesting the old-style thresholds API? - https://phabricator.wikimedia.org/T181534#3794565 (10awight) [22:10:05] 10Scoring-platform-team: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3794567 (10awight) [22:10:19] 10Scoring-platform-team: Why is nlwiki requesting the old-style thresholds API? - https://phabricator.wikimedia.org/T181534#3793451 (10Halfak) {F11056299} [22:10:28] 10Scoring-platform-team: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3794478 (10Halfak) {F11056299} [22:11:22] (03PS1) 10Awight: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) [22:14:31] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.058 second response time [22:22:41] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 2.531 second response time [22:30:24] (03CR) 10Halfak: [C: 031] Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [22:30:48] (03CR) 10Awight: [C: 04-1] "Not behaving correctly during smoke testing..." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [22:34:05] (03CR) 10Krinkle: Rate limit thresholds failures to once per (minute x model x wiki) (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [22:36:42] (03CR) 10Krinkle: [C: 04-1] Rate limit thresholds failures to once per (minute x model x wiki) (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [22:45:46] halfak: FWIW what’s happening now is that I can’t test the freaking patch ^ [22:45:53] All my vagrant thresholds work for naught [22:46:14] Nothing is getting called and I suppose I’m panicking trying to figure out the settings that must be preventing [22:48:49] S:Version shows the extension is installed, I’ve rolled back locally to old master... [22:48:56] nothing at all happening in Special:RC [22:50:00] No ores in the queries, nothing. [22:50:15] halfak: Do you have a working Extension:ORES? [22:50:23] Negative [22:50:44] Sorry. [22:50:51] kk [22:51:04] hmm. this is what vagrant is for. blowing it away and recreating. [22:59:23] 10Scoring-platform-team, 10Operations, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3794768 (10akosiaris) eqiad redis queue for ORES is no longer persisting in disk since 21:39 UTC today in an effort to address this.... [22:59:45] halfak: I’ve been hitting RecentChangesLinked again, somehow. That was the entire problem. [23:00:00] Oh damn. I do that sometimes too. [23:07:12] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794775 (10Halfak) OK. Dominant hypothesis is that we are DOS-ing ourselves via the ORES Ext. When we accidentally broke the $wgOresBaseUrl, the... [23:08:31] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794779 (10Halfak) BTW, see T181567 where we initially describe the correlation between failed "test_stats" request from MediaWiki and the downtime... [23:08:41] 10Scoring-platform-team, 10Patch-For-Review: Rate limit thresholds requests when the service is down - https://phabricator.wikimedia.org/T181567#3794478 (10Halfak) [23:08:43] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794781 (10Halfak) [23:09:43] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3793518 (10Halfak) [23:09:45] 10Scoring-platform-team (Current), 10Operations, 10Wikimedia-Incident: Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* - https://phabricator.wikimedia.org/T181563#3794783 (10Halfak) 05Open>03Resolved a:03Halfak This looks done to me. Feel free to re-open [23:13:11] (03PS2) 10Awight: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) [23:14:31] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:40] (03CR) 10jerkins-bot: [V: 04-1] Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [23:17:15] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794850 (10Halfak) It looks like we have causal evidence that MW is hammering ORES into the ground. Now that $wgOresBaseUrl is fixed, ORES is aga... [23:17:47] BTW awight: https://phabricator.wikimedia.org/T181538#3794850 [23:18:20] Anything I can do to help with the Extension work? [23:19:11] (03PS3) 10Awight: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) [23:19:45] halfak: Thanks 10^6 for fielding all the ops discussion while I have my head in the hot sand! [23:20:38] halfak: If tests pass on PS3, maybe you can CR. [23:20:48] I think Krinkle has signed off on the cache-specific stuff. [23:20:58] cool [23:21:02] ps3? [23:21:05] What do you think about the current situation? Seems like we shouldn’t leave it like this overnight? [23:21:07] patchset 3 :) [23:21:14] lol [23:21:15] I agree. [23:21:27] no I’ve been playing pirated mariokart this whole time :) [23:21:31] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 834 bytes in 0.521 second response time [23:21:36] lol [23:21:53] OK so I know what I want to do. [23:22:03] But we can't do that. [23:22:13] Get out of the software industry? [23:22:19] (Deploy your work now so we can keep ORES *and* stop hammering it) [23:22:31] Wait what can we not do? [23:22:39] Deploy your Ext change [23:22:41] Or can we? [23:22:55] Ah. I’m hoping. Asked greg-g for a window... [23:23:25] It’s only slightly less horribly sketchy than either mass-disabling across all wikis, or leaving a steaming mess all night. [23:24:05] I agree. [23:24:13] K we have V+2 on that thing, do your worst :) [23:24:22] We can torment beta for a bit. [23:27:40] awight, null is [] in PHP? [23:27:57] halfak: no, null turned out not to work. [] is just an empty array. [23:28:03] I think null was too false-ish [23:28:07] Gotcha. [23:28:11] lol PHP [23:28:16] CURSES [23:28:21] THough suppose Null/None is often Falseish [23:28:43] At least the situation isn’t as bad as MySQL null-ish [23:29:39] So, I'm looking at the change you made in extractBoundValue and not understanding. [23:29:40] is https://gerrit.wikimedia.org/r/#/c/393922/ supposed to stop the stampedes? [23:30:27] legoktm, that's the idea [23:30:28] legoktm: yes [23:30:54] halfak: That change is because I discovered that we’re not handling another type of “bad” response, in this case the cached [] [23:31:06] Gotcha. [23:31:16] 10Scoring-platform-team, 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3794913 (10akosiaris) Adding OOM kernel logs per host for posterity's sake. Feel free to ignore `electron` in the logs. It's already being memory... [23:31:25] Returning null now means “I don’t know wtf to do with this threshold, let’s skip it" [23:32:54] where does the actual waiting happen? [23:33:07] halfak: thanks for actually reviewing, BTW! [23:33:22] The waiting happens in the “catch” branch of the getWaitSetCallback callback [23:33:27] Totally :) I figure my dumb questions might help you catch something :D [23:33:35] We set &$ttl to one minute in case of a failure [23:35:12] what handles that ttl? [23:35:27] job queue? [23:35:43] That’s handled by WANObjectCache [23:35:47] https://doc.wikimedia.org/mediawiki-core/master/php/classWANObjectCache.html#a83f206cc616ee6492941da9e0a2db2f0 [23:35:50] https://doc.wikimedia.org/mediawiki-core/master/php/WANObjectCache_8php_source.html [23:36:22] Gotcha. OK> LGTM. I'ma +1 [23:36:40] (03CR) 10Halfak: [C: 031] Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [23:36:42] pls +2 [23:36:47] Or I can self-merge [23:36:47] (03CR) 10Halfak: [C: 032] Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [23:36:51] :D [23:36:59] halfak: awight: I don't think it will work as well as it could [23:37:08] WANObjectCache has a LOCK feature that should be used [23:37:14] or really we should introduce poolcounter here [23:37:25] legoktm, tonight or tomorrow? [23:37:37] I'm down for tonight if you think that's necessary [23:37:41] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 871 bytes in 0.018 second response time [23:37:48] * halfak isn't going anyway until the fires are put out. [23:37:54] well this patch isn't going to make this worse [23:38:02] legoktm: I was wondering about that—I’d also like to reuse the old value in case of failure [23:38:22] (03Merged) 10jenkins-bot: Rate limit thresholds failures to once per (minute x model x wiki) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393922 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [23:38:22] Wasn’t feeling up to experimenting with the guts of this arcane thing though :) [23:40:17] I’m poised to delete cache and kill the beta ores service. [23:41:13] so probably every request to the ORES service should go through poolcounter to avoid a DoS [23:41:41] And then on top of poolcounter, there should be a cache which is responsible for figuring out how to handle the service being unavailable, potentially using stale data [23:42:03] Do you think you could submit a change quick that would do poolcounter -- or is it very involved? [23:42:34] * awight humbly wants to keep this change minimal [23:42:57] The change as it is now should keep us down to num_wikis / minute in the worst case. [23:43:16] If you agree legoktm, I'd like to move forward with awight's change. [23:43:29] Although there could be race conditions, which I guess is what the lockTSE would be for? [23:44:02] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 7.535 second response time [23:44:11] At 400 requests in a second, a race condition might be no freaking joke. [23:44:15] poolcounter is probably more work [23:44:32] I think the change is fine, but I wouldn't consider the problem fixed [23:44:33] So assuming it takes 1s to fetch and store the failure, that’s actually 400 x num_wikis requests per minute. [23:44:47] yeah, if we could use the lock feature I think that would be good [23:46:25] legoktm: lemme take a shot at it, looks simple enough in code. Are you free in 5 min to review? [23:46:52] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.020 second response time [23:48:00] at some point btw I 'd like to understand why this is biting us today and not yesterday. Like.. what changed ? [23:48:16] awight: unfortunately no, I'm in class right now >.< [23:48:21] akosiaris, Right. I totally agree [23:48:54] legoktm, damn [23:50:59] (03PS1) 10Awight: Cache anti-stampede improvements [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393945 (https://phabricator.wikimedia.org/T181567) [23:51:02] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 7.533 second response time [23:52:27] (03CR) 10jerkins-bot: [V: 04-1] Cache anti-stampede improvements [extensions/ORES] - 10https://gerrit.wikimedia.org/r/393945 (https://phabricator.wikimedia.org/T181567) (owner: 10Awight) [23:54:52] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 872 bytes in 0.020 second response time [23:55:22] ok, I need to relocate before testing in beta [23:55:31] Cool. I'll read about lockTSE [23:56:24] I assume that uses a semaphore across get -> fetch -> set [23:57:32] psych. [23:57:35] stopped beta ores. [23:57:49] deleted enwiki damaging threshold key [23:58:17] confirmed degraded but functional page, and cached empty array. [23:58:39] Yeah... I can't grok these docs. [23:59:34] shit, how do I tell it’s not trying to request from the disabled service... [23:59:58] disabled ORES?