[09:26:59] (03CR) 10Ladsgroup: [V: 032 C: 032] Update to Celery 4 [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/386458 (https://phabricator.wikimedia.org/T178441) (owner: 10Awight) [09:27:14] (03CR) 10Ladsgroup: [V: 032 C: 032] Correct wheels submodule location [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/386457 (owner: 10Awight) [11:12:44] 10Scoring-platform-team, 10ORES, 10Services (watching): ORES eternal server error for edit with many added links - https://phabricator.wikimedia.org/T179064#3712140 (10Pchelolo) [15:19:41] o/ [15:20:27] Amir1: fyi, I ran into some badness on beta, so decided not to deploy yesterday. [15:21:47] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), and 2 others: Enable ORES filters for svwiki - https://phabricator.wikimedia.org/T174560#3712538 (10Lokal_Profil) @Halfak Hi Just checking if you have a new ETA for this? [15:25:50] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), and 2 others: Enable ORES filters for svwiki - https://phabricator.wikimedia.org/T174560#3712570 (10awight) @Lokal_Profil Hi, thank you for your patience! We have the models built and ready... [15:27:05] awight: I saw :( [15:27:11] What we can do now? [15:27:31] Cool. I’m going to debug locally to avoid my typical heavy churn on production :) [15:27:58] One thing that bothers me is that the problems might be caused by the very manual process I used to deploy to beta. [15:28:10] I might know soon, if I can reproduce locally. [15:45:26] Making things harder for myself by trying to run the ores prod repo directly, rather than using mw-vagrant. [16:25:41] 10Scoring-platform-team, 10Wikilabels, 10Google-Code-in-2017: pytest for flask application of wikilables - https://phabricator.wikimedia.org/T179015#3710308 (10Florian) @Ladsgroup Can you elaborate on what the task for a possible GCI student would be? What should they create/change exactly, what do you want... [16:37:56] Also trying to test Celery 4 on the new cluster, but I just ran into a glitch where the venv directory wasn’t rebuilt. [16:39:34] (03PS1) 10Awight: Update frozen-requirements.txt to include revscoring 2 and celery 4 [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/386646 [17:14:43] Amir1: Deployed to the canary, FYI. [17:14:56] Rolling back. [17:15:00] let's montitor traffic a little [17:15:15] GRR [17:15:23] rollback failed silently. [17:17:09] Forcing rollback to ab88a74d087efff620a3eeb0e5aad1540d2a838b [17:17:31] We should be back up. [17:17:35] That was horrible [17:18:04] Reading the logs to debug. [17:21:23] > ImportError: No module named 'docopt' [17:21:27] haahha [17:25:56] 10Scoring-platform-team (Current), 10ORES, 10Release-Engineering-Team: Scap doesn't rebuilt virtualenv directory when deploying to ores* targets - https://phabricator.wikimedia.org/T179095#3713113 (10awight) [17:31:03] 10Scoring-platform-team (Current), 10DBA, 10Operations, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3713192 (10bd808) >>! In T168584#3570152, @bd808 wrote: > If we do lose a disk on 1001/3 to the powercycle though it will be hard to recover so we s... [17:37:37] 10Scoring-platform-team (Current), 10ORES: Deployment to canary causes an import error on docopt - https://phabricator.wikimedia.org/T179098#3713210 (10awight) [17:39:11] (03PS1) 10Awight: Add checks to cause virtualenv rebuild on the new ORES cluster. [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/386663 (https://phabricator.wikimedia.org/T179095) [17:39:31] (03PS2) 10Awight: Add checks to cause virtualenv rebuild on the new ORES cluster. [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/386663 (https://phabricator.wikimedia.org/T179095) [17:40:30] (03CR) 10Ladsgroup: [V: 032 C: 032] Add checks to cause virtualenv rebuild on the new ORES cluster. [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/386663 (https://phabricator.wikimedia.org/T179095) (owner: 10Awight) [17:40:38] :D [17:42:42] 10Scoring-platform-team (Current), 10ORES: Deployment to canary causes an import error on docopt - https://phabricator.wikimedia.org/T179098#3713235 (10awight) Confirmed that the wheel is present in `submodules/wheels/docopt-0.6.2-py2.py3-none-any.whl`. [17:42:53] I’m going to deploy to the new cluster to investigate the docopt nonsense. [17:43:34] Amir1: Do you have suggestions for how to run the production deployment repo locally, realistically? [17:44:16] awight: AFAIK there is no way to do that [17:45:11] ok, thanks for the sanity check. I’ll deploy to one of the new cluster boxes as if it were my own :) [17:48:33] venv/lib/python3.4/site-packages/ is mostly empty. O_o [17:51:25] 10Scoring-platform-team (Current), 10ORES, 10Release-Engineering-Team, 10Patch-For-Review: Scap doesn't rebuilt virtualenv directory when deploying to ores* targets - https://phabricator.wikimedia.org/T179095#3713285 (10awight) Now I'm seeing a mostly empty venv directory... [17:57:29] 10Scoring-platform-team (Current), 10ORES: Deployment to canary causes an import error on docopt - https://phabricator.wikimedia.org/T179098#3713301 (10awight) This seems to be caused by T179095, we are rebuilding the venv directory empty. [18:01:55] 10Scoring-platform-team (Current), 10ORES, 10Release-Engineering-Team, 10Patch-For-Review: Scap doesn't rebuilt virtualenv directory when deploying to ores* targets - https://phabricator.wikimedia.org/T179095#3713306 (10awight) The target machine's logs don't say exactly what happens, I see: ``` Executing... [18:06:17] 10Scoring-platform-team (Current), 10ORES, 10Release-Engineering-Team, 10Patch-For-Review: Scap doesn't rebuilt virtualenv directory when deploying to ores* targets - https://phabricator.wikimedia.org/T179095#3713331 (10awight) Running in a user virtualenv, I obtained the clue: ``` pip install --use-wheel... [18:16:34] 10Scoring-platform-team (Current), 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3713361 (10madhuvishy) Started a planning doc for the reboots here - https://etherpad.wikimedia.org/p/labsdb-reboots [18:32:42] 10Scoring-platform-team (Current), 10ORES, 10Release-Engineering-Team, 10Patch-For-Review: Scap doesn't rebuilt virtualenv directory when deploying to ores* targets - https://phabricator.wikimedia.org/T179095#3713427 (10awight) I've rebuilt the wheels again, on ores-misc-01. Python version is the same, ma... [18:57:26] 10Scoring-platform-team (Current), 10ORES, 10Release-Engineering-Team, 10Patch-For-Review: Scap doesn't rebuilt virtualenv directory when deploying to ores* targets - https://phabricator.wikimedia.org/T179095#3713459 (10awight) I tried `pip install pip==1.5.6`, then rebuilt wheels, but we're still getting... [19:01:18] 10Scoring-platform-team, 10ORES: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107#3713464 (10awight) [19:01:27] 10Scoring-platform-team, 10ORES: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107#3713479 (10awight) p:05Triage>03Unbreak! [19:05:08] 10Scoring-platform-team, 10ORES: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107#3713491 (10awight) Seems that the root cause is: > requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='wikidata.org', port=443): Read timed out. (read tim... [19:06:22] Amir1: Do you know anything about the wikidata API faltering? ^ [19:07:25] I can't see [19:07:27] fuck [19:07:33] awight: ^ [19:07:46] How the fuck I can reset this fuck [19:07:51] Amir1: hehe, sorry. https://phabricator.wikimedia.org/T179107#3713491 [19:08:12] btw, fun unix tool for a less pressing moment: https://github.com/nvbn/thefuck [19:08:44] awight: I think it's a config [19:08:48] it should be a config [19:08:56] we need to increase the timeout [19:09:01] hmm k [19:09:12] but that's weird because wikidata api should respond in less than 5 secs [19:09:28] grr, I need to fork us at a stable version. [19:09:40] Wonder if that’s even going to work with scap... [19:10:15] I see that the config can be set in extractors.wikidata_api.timeout [19:10:28] looking for a graph of wikidata api median, 95th percentile etc. [19:11:17] * awight breaks into a sweat [19:12:02] I’m not finding the performance dashboard. [19:23:17] (03PS1) 10Awight: Blindly choose a timeout of 15 seconds [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/386688 (https://phabricator.wikimedia.org/T179107) [19:25:29] (03CR) 10Ladsgroup: [V: 032 C: 032] Blindly choose a timeout of 15 seconds [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/386688 (https://phabricator.wikimedia.org/T179107) (owner: 10Awight) [19:27:39] Amir1: I’m gonna tweak the code to make that exception a hair quieter, unless you feel strongly that it should be a fatal? [19:28:35] IMO when service fatals, the extension should fatal too (to convey the problem) but to be honest, it's not a strong feeling [19:28:45] if someone else agrees with you, I'm fine [19:30:31] k i’m starting with the timeout tweak, since it’s just wikidata. [19:31:24] (03PS1) 10Awight: Blindly choose a timeout of 15 seconds [services/ores/deploy] (STABLE_REVSCORING_1) - 10https://gerrit.wikimedia.org/r/386691 (https://phabricator.wikimedia.org/T179107) [19:31:37] (03CR) 10Awight: [V: 032 C: 032] "cherry-picking to stable" [services/ores/deploy] (STABLE_REVSCORING_1) - 10https://gerrit.wikimedia.org/r/386691 (https://phabricator.wikimedia.org/T179107) (owner: 10Awight) [20:02:19] 10Scoring-platform-team, 10ORES, 10Patch-For-Review: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107#3713625 (10awight) After a quick discussion on IRC, the consensus is that Extension:ORES should continue to fatal when it receives fatals f... [20:04:57] 10Scoring-platform-team, 10ORES, 10Wikidata: Wikidata MediaWiki API timing out on ORES extractor requests - https://phabricator.wikimedia.org/T179112#3713627 (10awight) [20:08:52] Amir1: Nice work! [20:09:17] I wish I could help me [20:09:20] *more [20:09:47] Amir1: One thing that struck me as weird is that I didn’t find any of the API requests in logstash [20:09:52] Amir1: you fixed it! [20:09:57] 10Scoring-platform-team, 10ORES, 10Patch-For-Review: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107#3713645 (10greg) {F10455593} [20:10:18] We can pick it up again tomorrow, I’ll poke around a bit and will comment on the task... [20:11:37] okay [20:11:44] I need to leave soon [20:11:54] will be back online really before the meeting [20:20:26] 10Scoring-platform-team (Current), 10ORES, 10Wikidata: Wikidata MediaWiki API timing out on ORES extractor requests - https://phabricator.wikimedia.org/T179112#3713697 (10awight) [20:20:35] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107#3713699 (10awight) [20:33:50] 10Scoring-platform-team (Current), 10ORES, 10Wikidata: Wikidata MediaWiki API timing out on ORES extractor requests - https://phabricator.wikimedia.org/T179112#3713725 (10awight) It seems that the wikidata API is slow overall, so this might be resolved by external work. Meanwhile, I haven't been able to det... [20:34:14] O/ [20:36:30] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review: ORES service erroring, in a way that throws exceptions in Extension:ORES - https://phabricator.wikimedia.org/T179107#3713731 (10awight) One thing we can do in the future is to throw a more specific exception, which passes through the ORES service... [20:39:29] 10Scoring-platform-team (Current), 10ORES: ORES service should return a readable error when MW API read times out - https://phabricator.wikimedia.org/T179117#3713743 (10awight) [20:56:34] 10Scoring-platform-team, 10MediaWiki-extensions-ORES, 10Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2017): UX check RC Filters in beta (revscoring 2.0/thresholds release) - https://phabricator.wikimedia.org/T178395#3713814 (10Halfak) I think that @Ladsgroup or @awight might be better at answering your... [20:57:27] awight: halfak fyi ores may see lots of 5xxs due to 503s due to varnish [20:57:34] And apipools and etc [20:57:48] Zppix: thx yes we were knee-deep in it unfortunately. [20:57:49] Yikes. Is there some sort of outage? [20:58:00] * halfak just got back online [20:58:29] halfak: It’s still an open question, but we were getting a lot of timeouts from api.php, especially on wikidata.org. [20:58:44] Gotcha. [20:58:53] Related to ORES or something bigger? [20:58:58] That cascaded through the ORES service, Extension:ORES and was causing a ton of fatals. [20:59:10] It’s also occurring on other wikis. [20:59:15] Not just ORES, “luckily”. [20:59:18] lol [20:59:20] We can catch up in the meeting... [20:59:31] Amir1 saved the day, as always :D [20:59:32] * halfak looks at 5xx rate graph [20:59:35] O_O [20:59:42] \o/ Amir1! [20:59:52] Good time to ping. MEEEEETING [20:59:53] yeah the long timeouts are using up all of the available API connections [20:59:57] hehe [21:00:51] halfak: its an operations fault [21:01:45] I was getting complaints from enwiki users (im also a tech ambassador god do i do to much) and i thoughjt id relay it here [21:07:10] halfak: sorry for being late [21:46:45] 10Scoring-platform-team, 10ORES, 10Services (watching): ORES eternal server error for edit with many added links - https://phabricator.wikimedia.org/T179064#3713862 (10Halfak) @awight was experimenting with deployments on beta. This will likely get resolved when we get the next version of ORES first deploye... [21:47:11] 10Scoring-platform-team (Current), 10ORES, 10Services (watching): ORES eternal server error for edit with many added links - https://phabricator.wikimedia.org/T179064#3713866 (10Halfak) [21:47:22] 10Scoring-platform-team, 10Wikilabels, 10Google-Code-in-2017: pytest for flask application of wikilabels - https://phabricator.wikimedia.org/T179015#3713867 (10awight) [21:47:53] 10Scoring-platform-team, 10Wikilabels, 10Google-Code-in-2017: pytest for flask application of wikilabels - https://phabricator.wikimedia.org/T179015#3710308 (10Halfak) p:05Normal>03Low [21:48:08] 10Scoring-platform-team, 10Wikilabels, 10Google-Code-in-2017: pytest for database of wikilabels - https://phabricator.wikimedia.org/T179014#3710279 (10Halfak) p:05Normal>03Low [21:49:38] 10Scoring-platform-team, 10Patch-For-Review: ORES deployment submodules should point to phabricator HTTPS repos. - https://phabricator.wikimedia.org/T179009#3710070 (10Halfak) We used to point to the https in phabricator explodes when you try to pull from our giant repos. [21:49:41] 10Scoring-platform-team, 10Patch-For-Review: ORES deployment submodules should point to phabricator HTTPS repos. - https://phabricator.wikimedia.org/T179009#3713874 (10awight) [21:49:44] 10Scoring-platform-team, 10Gerrit, 10ORES, 10Operations, and 3 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3713875 (10awight) [21:50:27] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10ORES: ORESFetchScoreJob: RuntimeException No model available for [goodfaith] - https://phabricator.wikimedia.org/T178792#3713878 (10Halfak) p:05Triage>03High [21:50:56] 10Scoring-platform-team, 10Bad-Words-Detection-System, 10revscoring, 10artificial-intelligence: Add language support for Icelandic - https://phabricator.wikimedia.org/T178524#3694816 (10Halfak) p:05Triage>03Normal [21:51:23] 10Scoring-platform-team, 10Bad-Words-Detection-System, 10revscoring, 10artificial-intelligence: Add language support for Icelandic - https://phabricator.wikimedia.org/T178524#3694816 (10Halfak) @Snaevar ^ [21:52:48] 10Scoring-platform-team: FetchScoreJob is trying to update scores for nonexistent models - https://phabricator.wikimedia.org/T177967#3713890 (10Halfak) @Ladsgroup said he knows about a related bug. Maybe share it here. :) [21:53:14] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure: FetchScoreJob is trying to update scores for nonexistent models - https://phabricator.wikimedia.org/T177967#3676372 (10Halfak) p:05Triage>03Normal [21:53:19] 10Scoring-platform-team, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: FetchScoreJob is trying to update scores for nonexistent models - https://phabricator.wikimedia.org/T177967#3713895 (10awight) p:05Normal>03Triage [21:54:05] 10Scoring-platform-team, 10Wikilabels: [Discuss] Wikilabels routes refactor - https://phabricator.wikimedia.org/T165046#3255293 (10Halfak) p:05Triage>03Low [21:57:53] biab [22:53:48] 10Scoring-platform-team, 10Patch-For-Review: ORES deployment submodules should point to phabricator HTTPS repos. - https://phabricator.wikimedia.org/T179009#3713990 (10demon) We need to fix that. SSH doesn't scale indefinitely.