[00:04:57] (03PS2) 10Catrope: Follow-up c047cd54d69ed: rename oresDamagingPref values back [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350316 (https://phabricator.wikimedia.org/T160575) [00:06:37] (03CR) 10jerkins-bot: [V: 04-1] Follow-up c047cd54d69ed: rename oresDamagingPref values back [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350316 (https://phabricator.wikimedia.org/T160575) (owner: 10Catrope) [00:08:31] (03PS1) 10Catrope: Revert "Localisation updates from https://translatewiki.net." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350335 [00:17:09] (03CR) 10Jforrester: [C: 032] Revert "Localisation updates from https://translatewiki.net." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350335 (owner: 10Catrope) [00:17:39] (03PS3) 10Catrope: Follow-up c047cd54d69ed: rename oresDamagingPref values back [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350316 (https://phabricator.wikimedia.org/T160575) [00:18:28] (03PS2) 10Jforrester: Follow-up 2b68933208: use variables [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350008 (owner: 10Catrope) [00:18:52] (03Merged) 10jenkins-bot: Revert "Localisation updates from https://translatewiki.net." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350335 (owner: 10Catrope) [00:21:15] (03CR) 10jerkins-bot: [V: 04-1] Follow-up 2b68933208: use variables [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350008 (owner: 10Catrope) [00:36:33] (03PS3) 10Catrope: Follow-up 2b68933208: use variables [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350008 [00:38:58] (03CR) 10Jforrester: [C: 032] Follow-up 2b68933208: use variables [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350008 (owner: 10Catrope) [00:40:18] halfak: hye, I just woke up [00:40:30] do you want me to review it now? [00:40:31] hey! New ORES is in wmflabs and beta [00:40:36] nice [00:40:36] Na. I self-merged :) [00:40:42] (03Merged) 10jenkins-bot: Follow-up 2b68933208: use variables [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350008 (owner: 10Catrope) [00:41:05] Could you check to see if everything is happy with beta stuff? [00:41:13] (03CR) 10Mattflaschen: [C: 032] Follow-up c047cd54d69ed: rename oresDamagingPref values back [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350316 (https://phabricator.wikimedia.org/T160575) (owner: 10Catrope) [00:42:42] sure [00:42:42] I'm just reading lots and lots of emails now [00:43:01] (03Merged) 10jenkins-bot: Follow-up c047cd54d69ed: rename oresDamagingPref values back [extensions/ORES] - 10https://gerrit.wikimedia.org/r/350316 (https://phabricator.wikimedia.org/T160575) (owner: 10Catrope) [01:11:29] No hurry. I just want to make sure we [01:11:33] 're OK to deploy tomorrow [01:11:35] Amir1, ^ [01:11:52] I'm actually not in that much of a rush to deploy :) [01:13:10] halfak: sure [02:21:33] 06Revision-Scoring-As-A-Service, 10ORES: Deploy ORES mid-April - https://phabricator.wikimedia.org/T162892#3212810 (10Halfak) Thanks! [07:26:14] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "All issues fixed, thanks. I still don't know enough about the bigger picture of this change, and do not plan to dig deeper into the code. " (034 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/344815 (https://phabricator.wikimedia.org/T161029) (owner: 10Gergő Tisza) [10:09:11] (03PS5) 10Gergő Tisza: Forward request data to ORES API [extensions/ORES] - 10https://gerrit.wikimedia.org/r/344815 (https://phabricator.wikimedia.org/T161029) [12:01:33] 10Revision-Scoring-As-A-Service-Backlog, 10Beta-Cluster-Infrastructure, 10ORES: ORES errors on zhwp beta - https://phabricator.wikimedia.org/T163873#3213496 (10Aklapper) [13:59:55] I will be five minutes late to the retro [14:04:33] I'm there [14:23:34] 10Revision-Scoring-As-A-Service-Backlog, 10ORES, 13Patch-For-Review: Implement twemproxy for ORES in production - https://phabricator.wikimedia.org/T122676#3213980 (10akosiaris) So, a case of needing a maintenance on the (primary/master) redis host has arrived. We will need to perform some cabling changes to... [16:05:36] 06Revision-Scoring-As-A-Service, 10Edit-Review-Improvements-RC-Page, 10ORES, 06Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017), 07User-notice: Enable the ORES good faith and damaging UI by default, on wikis that have these ORES models available (i... - https://phabricator.wikimedia.org/T158225#3214512 [17:16:57] 06Revision-Scoring-As-A-Service, 10MediaWiki-JobQueue, 10MediaWiki-Watchlist, 10MediaWiki-extensions-ORES, and 5 others: Watchlist entries duplicated several times - https://phabricator.wikimedia.org/T163337#3214837 (10Krinkle) >>! In T163337#3203234, @Krinkle wrote: >>>! In T163337#3200616, @Joe wrote: >>... [17:41:20] 06Revision-Scoring-As-A-Service, 10MediaWiki-JobQueue, 10MediaWiki-Watchlist, 10MediaWiki-extensions-ORES, and 5 others: Watchlist entries duplicated several times - https://phabricator.wikimedia.org/T163337#3193929 (10elukey) Checked very quickly replication status and db keys stored for the rdbs 2003 / 1... [17:52:04] 10Revision-Scoring-As-A-Service-Backlog, 10ORES, 10revscoring: Explore OSM integration for ORES - https://phabricator.wikimedia.org/T163928#3214997 (10Halfak) [20:03:37] 10Revision-Scoring-As-A-Service-Backlog, 10Beta-Cluster-Infrastructure, 10ORES: On beta cluster, ORESFetchScoreJob got a HTTP 400 bad request from ores-beta - https://phabricator.wikimedia.org/T157790#3215429 (10Mattflaschen-WMF) [20:04:36] 10Revision-Scoring-As-A-Service-Backlog, 10Beta-Cluster-Infrastructure, 10ORES: On beta cluster, ORESFetchScoreJob got a HTTP 400 bad request from ores-beta - https://phabricator.wikimedia.org/T157790#3016510 (10Mattflaschen-WMF) [20:04:46] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Prod: Bad Request (400) on testwiki test models - https://phabricator.wikimedia.org/T163764#3208633 (10Mattflaschen-WMF) [20:05:03] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Prod: Bad Request (400) on testwiki test models - https://phabricator.wikimedia.org/T163764#3208633 (10Mattflaschen-WMF) [20:37:10] 06Revision-Scoring-As-A-Service, 10ORES: Deploy ORES mid-April - https://phabricator.wikimedia.org/T162892#3178444 (10Halfak) a:03Halfak [20:37:52] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Train/test damaging & goodfaith model for Finnish Wikipedia - https://phabricator.wikimedia.org/T163012#3183245 (10Halfak) This is now deployed. We're ready for an ORES Review Tool or an ERI deployment [20:41:15] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:41:28] WOops. Something is up. [20:42:15] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 955 bytes in 6.559 second response time [20:42:25] Weird. [20:45:15] hiccup? [20:51:30] 6.5 second response time though?! [21:00:57] 06Revision-Scoring-As-A-Service, 10ORES: Timeouts on CODFW - https://phabricator.wikimedia.org/T163944#3215710 (10Halfak) [21:01:07] 06Revision-Scoring-As-A-Service, 10ORES: Timeouts on CODFW - https://phabricator.wikimedia.org/T163944#3215723 (10Halfak) p:05Triage>03Unbreak! [21:01:42] 06Revision-Scoring-As-A-Service, 10ORES: Timeouts on CODFW - https://phabricator.wikimedia.org/T163944#3215726 (10Halfak) Here's some scorings directly to EQIAD (Note the 1.3-1.5s speed) ``` halfak@scb1002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242326 { "scores": { "fiwiki": { "goodf... [21:02:19] 06Revision-Scoring-As-A-Service, 10ORES: Timeouts on CODFW - https://phabricator.wikimedia.org/T163944#3215727 (10Halfak) Same scorings to a server in CODFW (2 timeout and 1 runs in 12s) ``` halfak@scb2002:~$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242326 { "scores": { "fiwiki": { "goodf... [21:05:05] 06Revision-Scoring-As-A-Service, 10ORES: Timeouts on CODFW - https://phabricator.wikimedia.org/T163944#3215732 (10Halfak) I get this in the applog on scb2002: ``` Traceback (most recent call last): File "/srv/deployment/ores/venv/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task R... [21:10:41] Trying a restart. [21:10:54] It looks like there's somehow some old code running. Like celery didn't get restarted in codfw [21:11:21] 06Revision-Scoring-As-A-Service, 10ORES: Timeouts on CODFW - https://phabricator.wikimedia.org/T163944#3215750 (10Halfak) I just ran `scap deploy --service-restart` [21:15:25] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:36] right [21:19:15] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 940 bytes in 0.521 second response time [21:24:26] WTF [21:24:38] halfak: I get intermittent 404s from https://ores.wikimedia.org/scores/fiwiki/damaging/?model_info=test_stats saying no models available for fiwiki [21:24:43] But if I refresh it fixes itself [21:24:56] Right. There's an issue we're having [21:25:12] Old code doesn't know that fiwiki exists and will also cause errors when scoring. [21:25:16] Ditto for https://ores.wikimedia.org/scores/etwiki/goodfaith/?model_info=test_stats I get intermittent 400s saying there is no goodfaith model for etwiki, and if I refresh I get data [21:25:22] It's only codfw that has the problem. [21:25:22] Hmm [21:25:34] I'm working on it as fast as I can. [21:25:48] I don't believe a rollback will help because the deployment system isn't working right for codfw. [21:31:09] No worries from my side, I got the data that I needed [21:31:52] Shouldn't break RCFilters in prod either because it's still using hardcoded thresholds [21:32:01] But we're planning to switch back to using test_stats tomorrow [21:32:13] Even then though, it's supposed to be caching those test_stats for a while [21:32:36] 24h [21:32:46] So intermittent failures should not cause much trouble [21:33:59] RoanKattouw, we should just be able to fail over to eqiad. I don't know why no one is getting the page for this :\ [21:34:12] Right [21:35:41] 10Revision-Scoring-As-A-Service-Backlog, 10MediaWiki-extensions-ORES, 06Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017): Deploy ORES Review Tool to Finnish Wikipedia - https://phabricator.wikimedia.org/T163011#3215809 (10Catrope) a:03Catrope [21:36:25] 10Revision-Scoring-As-A-Service-Backlog, 10MediaWiki-extensions-ORES, 06Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017): Deploy ORES Review Tool to Finnish Wikipedia - https://phabricator.wikimedia.org/T163011#3183232 (10Catrope) [21:36:28] 06Revision-Scoring-As-A-Service, 10rsaas-editquality, 07Epic: [Epic] Edit quality models (damaging/goodfaith) - https://phabricator.wikimedia.org/T130213#3215814 (10Catrope) [21:36:31] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Train/test damaging & goodfaith model for Finnish Wikipedia - https://phabricator.wikimedia.org/T163012#3183245 (10Catrope) [21:36:33] 06Revision-Scoring-As-A-Service, 10Wikilabels, 10rsaas-editquality: Complete editquality campaign for Finnish Wikipedia - https://phabricator.wikimedia.org/T163013#3215815 (10Catrope) 05Open>03Resolved a:03Catrope [21:36:40] 06Revision-Scoring-As-A-Service, 10Wikilabels, 10rsaas-editquality: Complete editquality campaign for Finnish Wikipedia - https://phabricator.wikimedia.org/T163013#3183261 (10Catrope) a:05Catrope>03None [21:42:12] 06Revision-Scoring-As-A-Service, 10ORES: Timeouts on CODFW - https://phabricator.wikimedia.org/T163944#3215841 (10Halfak) That didn't help anything. Still seeing this error in the logs. I've confirmed that new code was in fact deployed to scb2002 and for some reason uwsgi is just not using it. [21:42:37] 06Revision-Scoring-As-A-Service, 10ORES: Timeouts on CODFW - https://phabricator.wikimedia.org/T163944#3215842 (10Halfak) FYI: https://gerrit.wikimedia.org/r/350487 This will re-route traffic to eqiad. [21:50:38] 06Revision-Scoring-As-A-Service, 10ORES: Timeouts on CODFW - https://phabricator.wikimedia.org/T163944#3215901 (10Halfak) Here's some IRC: ``` [16:43:36] halfak: done.. how is it now? [16:43:50] * halfak watches the errors suddenly stop on codfw [16:43:54] checking on grafana [16:45:31] 06Revision-Scoring-As-A-Service, 10ORES: Investigate failed deploy to CODFW - https://phabricator.wikimedia.org/T163950#3215912 (10Halfak) [21:54:43] 06Revision-Scoring-As-A-Service, 10ORES: Investigate failed deploy to CODFW - https://phabricator.wikimedia.org/T163950#3215937 (10Halfak) [21:55:59] RoanKattouw, looks like we're in the clear. [21:57:58] halfak: Awesome, WFM now [21:58:07] \o/ [21:58:37] Well, that was stressful. I feel like I should have a cigarette or something. What do non-smokers do in a situation like this? [22:02:57] have a beer [22:12:20] 06Revision-Scoring-As-A-Service, 10ORES: Investigate failed deploy to CODFW - https://phabricator.wikimedia.org/T163950#3216017 (10Halfak) https://wikitech.wikimedia.org/wiki/Incident_documentation/20170426-ORES [22:12:32] FYI: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170426-ORES [22:12:37] * halfak goes to get a beer. [22:17:12] Nettrom, https://imgur.com/gallery/WaO9Y [22:17:43] BTW, I've been working on that wiki-research-l thread in shifts. [22:17:53] I think I'm going to need a summary from you :@ [22:20:09] 10Revision-Scoring-As-A-Service-Backlog, 10MediaWiki-extensions-ORES, 06Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017), 05MW-1.29-release (WMF-deploy-2017-04-11_(1.29.0-wmf.20)), 13Patch-For-Review: " Highlight likely problem edits with colors an... - https://phabricator.wikimedia.org/T163712#3216038 [22:33:36] halfak: that looks nice, and well deserved, hope you enjoy it! :) [22:33:51] yeah, the wiki-research-l thread took off yesterday… I’ve read through it today and prepped some notes on a response [22:33:56] :) Thanks. Reading through my CSCW submission now :) [22:34:16] Nice. I'd love to talk to you for 15 minutes or so about what you got from it. [22:34:48] I can add it to the agenda for tomorrow’s meeting? [22:34:55] or do you prefer a 1-on-1? [22:37:13] Na. Tomorrow's meeting sounds great :D [22:37:27] Oooh and we might have Andrew Hall in that meeting [22:37:31] cool, I’m already on there, so I can just add this… [22:37:36] As he's officially an intern now. [22:39:50] oh cool!