[10:01:53] o/ [10:02:04] let me see what's wrong with my monitors :/ [10:09:09] PROBLEM - ping4 on ORES-worker01.experimental is WARNING: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 1.10 ms [10:38:46] 10ORES, 10Scoring-platform-team, 10Growth-Team: RC filters: Un-scored Bot edits in Namespaces displayed with ORES full-coverage filters - https://phabricator.wikimedia.org/T206271 (10SBisson) Since this only occurs for the edge case in which a user selects all model scores, we will not prioritize. There is... [15:15:37] (03PS1) 10Sbisson: Fix articlequality thresholds [extensions/ORES] - 10https://gerrit.wikimedia.org/r/468999 (https://phabricator.wikimedia.org/T207614) [15:28:55] 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441 (10Ladsgroup) https://github.com/wikimedia/ores/pull/276 [15:35:29] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 1945 bytes in 0.032 second response time [15:36:07] That's me ^ [15:37:09] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 1945 bytes in 0.026 second response time [15:37:20] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 1945 bytes in 0.045 second response time [16:08:30] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.557 second response time [16:09:09] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.692 second response time [16:09:19] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.646 second response time [16:33:23] hoo: You may have noticed another heap of CR piled on your inbox... They're less urgent, mostly test coverage, but should also be simpler to review. [16:33:30] awight: one thing that would help would be if you went through https://phabricator.wikimedia.org/project/view/2872/ and closed the tasks that are complete [16:33:40] harej: +1 thanks [16:34:23] 10JADE, 10Scoring-platform-team: Integrate JADE with MediaWiki "patrol" action - https://phabricator.wikimedia.org/T198085 (10awight) 05Open>03declined [16:34:30] 10JADE, 10Scoring-platform-team: Integrate JADE with PageTriage "Mark as reviewed" action - https://phabricator.wikimedia.org/T198086 (10awight) 05Open>03declined [16:34:54] 10JADE, 10Scoring-platform-team: Integrate JADE with FlaggedRevs manual review actions - https://phabricator.wikimedia.org/T198090 (10awight) 05Open>03declined [16:35:10] awight: Will (probably) have a look in a bit :) [16:35:15] harej: n.b. ^ I'm declining the "write-only" integrations I had planned, because our new strategy is to have a "full" integration or none at all... [16:35:24] +1 [16:36:31] 10JADE, 10Scoring-platform-team: Where should we host permalinks to JADE JSON schemas? - https://phabricator.wikimedia.org/T188284 (10awight) 05Open>03Resolved a:03awight "hacky way" is probably fine, I'll use that in the json-schema "id" field. [16:36:56] 10JADE, 10Scoring-platform-team: Investigate MCR support gap for JADE purposes - https://phabricator.wikimedia.org/T204303 (10awight) a:05awight>03None [16:38:58] 10JADE, 10Scoring-platform-team: Advanced handling for Jade edit conflicts - https://phabricator.wikimedia.org/T198691 (10awight) p:05Low>03Lowest [16:39:32] 10JADE, 10Scoring-platform-team, 10Design: Create overlay UI for editing Judgement pages - https://phabricator.wikimedia.org/T199128 (10awight) [16:39:44] 10JADE, 10Scoring-platform-team, 10Design: Design "rationales" integration for JADE feedback - https://phabricator.wikimedia.org/T185247 (10awight) [16:40:00] 10JADE, 10Scoring-platform-team, 10Design: Integrate Judgment watchlists with the target wiki entity - https://phabricator.wikimedia.org/T201361 (10awight) [16:40:13] 10JADE, 10Scoring-platform-team, 10MediaWiki-extensions-Scribunto: Expose JADE data through Lua - https://phabricator.wikimedia.org/T203853 (10awight) [16:40:43] 10JADE, 10Scoring-platform-team: Write a letter to JADE stakeholders - https://phabricator.wikimedia.org/T197668 (10awight) a:05awight>03None [16:41:07] 10JADE, 10Scoring-platform-team: Write a letter to JADE stakeholders - https://phabricator.wikimedia.org/T197668 (10awight) a:03Harej [16:41:24] 10JADE, 10Scoring-platform-team (Current): Write a letter to JADE stakeholders - https://phabricator.wikimedia.org/T197668 (10awight) [16:42:58] 10JADE, 10Scoring-platform-team (Current): Review CSCW workshop paper for JADE - https://phabricator.wikimedia.org/T205892 (10awight) 05Open>03Resolved [16:43:00] 10Scoring-platform-team (Current), 10Documentation: Workshop proposal for CSCW (JADE, ORES, etc.) - https://phabricator.wikimedia.org/T204134 (10awight) [16:43:20] 10JADE, 10Scoring-platform-team (Current), 10DBA, 10Operations, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) [16:43:25] 10JADE, 10Scoring-platform-team (Current), 10DBA, 10Operations, and 4 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) 05Open>03Resolved [16:46:43] 10JADE, 10Scoring-platform-team (Current), 10DBA, 10Operations, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @daniel @Krinkle @Catrope @Marostegui We're ready for another round of TechCom and DBA review, at... [16:48:01] awight: are you going to the scoring platform + research meeting in 13 minutes? [16:48:09] yep [16:48:17] Not sure what we have to bring to that table, though [16:48:52] I'm just planning to be there to answer any questions Research might have, I guess. [16:49:21] awight hi, would you be able to check the security group for project ores in horizion (for port 5666) please? [16:49:39] paladox: sure, thanks--is this about the Icinga alerts? [16:49:46] yup [16:49:58] i've acked them so they shoulden't notify now. [16:54:54] paladox: ok lessee, we have security group Nagio, allowing TCP ingress on port 5666 from all IPs [16:55:01] It's definitely TCP and not UDP? [16:55:23] yup tcp [16:55:31] is that applied to the instances? [16:55:43] since only the default group is applied [16:55:52] ah ok checking [16:56:13] 10ORES, 10Scoring-platform-team, 10Release-Engineering-Team (Kanban), 10User-MModell: Create gerrit mirrors for all github-based ORES repos - https://phabricator.wikimedia.org/T192042 (10mmodell) Phabricator doesn't have proper git-lfs support and I've been told not to put any resources into phabricator's... [16:56:24] I also see a 5666 rule in the default security group, which looks wiser. Same thing but only allows from 10.0.0.0/24 [16:56:31] err 10.0.0.0/8 [16:57:05] paladox: Can you give me an example failing instance? [16:57:18] ores-redis-02.ores.eqiad.wmflabs [16:57:42] and sorry--also an example succeeding instance? [16:58:05] you're right that the instances only have "default" and not that Nagios group [16:58:11] so they should allow ingress from 10/8 [16:59:18] paladox: Just for fun, I'm adding the nagios 0./0 rule to ores-redis-02, so we can see if it makes a difference. [16:59:27] {done} [16:59:29] ok thanks! [16:59:31] gtg for now [16:59:41] Thanks for tracking this :) [16:59:57] awight it works! [17:00:02] RECOVERY - check disk on ORES-redis02.experimental is OK: DISK OK [17:00:39] RECOVERY - check load on ORES-redis02.experimental is OK: OK - load average: 0.00, 0.01, 0.00 [17:01:02] paladox: ah ha! [17:02:02] awight could you apply it to ores-worker-02.ores.eqiad.wmflabs [17:02:15] and ores-worker-02.ores.eqiad.wmflabs [17:02:17] please? [17:02:23] * paladox and ores-worker-01.ores.eqiad.wmflabs [17:02:31] and ores-web-02.ores.eqiad.wmflabs [17:03:15] will do momentarily [17:03:23] but I don't think this is ideal [17:03:31] Why can't we restrict to internal IPs [17:03:33] ? [17:04:25] o/ [17:04:27] * halfak is at the tech conference [17:04:31] So I'll mark myself as AFK, but I'm sort of here. [17:09:57] 10JADE, 10Scoring-platform-team (Current), 10DBA, 10Operations, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) moving to TechCom inbox for review [17:12:40] awight: halAFK ores.wmflabs.org is being ran on celery 4.1 (the branch celery4 in ores, it has a open PR) and it's fine: https://grafana-labs.wikimedia.org/dashboard/db/ores-labs?orgId=1 [17:12:46] I just hammered it :P [17:13:38] Nice! Can you bring it to overload and see what happens? [17:13:48] I want to see if it can recover or not. [17:13:53] because we had that problem before. [17:14:18] Amir1, ^ [17:14:30] sure [17:15:40] 100 threads at the same time, it should go down :D [17:22:48] lol. Right. [17:22:54] * halAFK braces for impact [17:23:11] I think we might even have a utility for hammering. [17:23:28] I think "ores score --include-features" [17:23:53] "ores stress_test"! [17:26:27] Amir1, ^ FYI [17:26:32] oh thanks [17:28:23] (03PS2) 10Catrope: Fix articlequality thresholds [extensions/ORES] - 10https://gerrit.wikimedia.org/r/468999 (https://phabricator.wikimedia.org/T207614) (owner: 10Sbisson) [17:28:23] I have to go, will be back soon [17:28:32] (03CR) 10Catrope: [C: 032] Fix articlequality thresholds [extensions/ORES] - 10https://gerrit.wikimedia.org/r/468999 (https://phabricator.wikimedia.org/T207614) (owner: 10Sbisson) [17:36:40] paladox: Do you know why we can't restrict that rule to internal IPs? Maybe there's another range to add during the Neutron migration? [17:37:04] awight you can if you know the range (i think) [17:37:44] paladox: When you get a minute, could you let me know the internal IP of the icinga2 box? [17:37:51] ok [17:37:53] * paladox looks [17:38:04] Thanks! [17:38:05] awight 172.16.1.180 [17:38:09] hmm [17:38:16] ah [17:38:43] I should be able to see the IP in the ORES box's netstat, in case it's geting NAT'd. [17:39:14] ok [17:42:31] paladox: Looks like there's an established route across the nets, so I'm adding a rule to allow nagios from 172.16.0.0/16 [17:42:38] RECOVERY - check users on ORES-worker02.experimental is OK: USERS OK - 0 users currently logged in [17:42:38] RECOVERY - puppet on ORES-web02.Experimental is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures [17:42:40] RECOVERY - check disk on ORES-web02.Experimental is OK: DISK OK [17:42:43] ok [17:43:03] :D [17:43:05] Seems like it works [17:43:06] RECOVERY - check load on ORES-web01.Experimental is OK: OK - load average: 0.00, 0.01, 0.03 [17:43:18] I've removed the nagios group from ores-redis-02 FYI [17:43:28] RECOVERY - check load on ORES-web02.Experimental is OK: OK - load average: 0.00, 0.01, 0.00 [17:43:58] RECOVERY - check load on ORES-worker01.experimental is OK: OK - load average: 0.00, 0.03, 0.44 [17:44:04] RECOVERY - check load on ORES-worker02.experimental is OK: OK - load average: 0.17, 0.10, 0.49 [17:44:13] RECOVERY - check disk on ORES-worker01.experimental is OK: DISK OK [17:44:13] RECOVERY - puppet on ORES-worker01.experimental is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [17:44:18] RECOVERY - puppet on ORES-web01.Experimental is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [17:44:32] RECOVERY - check disk on ORES-web01.Experimental is OK: DISK OK [17:44:35] RECOVERY - check users on ORES-web02.Experimental is OK: USERS OK - 0 users currently logged in [17:44:48] RECOVERY - check users on ORES-web01.Experimental is OK: USERS OK - 0 users currently logged in [17:44:50] RECOVERY - check users on ORES-worker01.experimental is OK: USERS OK - 0 users currently logged in [17:44:51] RECOVERY - check disk on ORES-worker02.experimental is OK: DISK OK [17:44:57] RECOVERY - puppet on ORES-worker02.experimental is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:37] awight should add port 22 to the "nagios" group and limit to that ip too [17:48:42] http://gerrit-icinga.wmflabs.org/dashboard#!/monitoring/service/show?host=ORES-redis02.experimental&service=ssh [17:49:58] RECOVERY - ssh on ORES-worker02.experimental is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [17:50:02] paladox: Done, thanks [17:50:18] Glad we finally have this debugged! [17:50:18] RECOVERY - ssh on ORES-web01.Experimental is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [17:50:43] RECOVERY - ssh on ORES-web02.Experimental is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [17:51:07] RECOVERY - ssh on ORES-worker01.experimental is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [17:51:36] RECOVERY - ssh on ORES-redis02.experimental is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [17:55:59] 10JADE, 10Scoring-platform-team (Current), 10Release-Engineering-Team, 10Continuous-Integration-Config: Keep JADE compatible with MediaWiki LTS - https://phabricator.wikimedia.org/T207678 (10awight) [18:05:09] (03CR) 10jenkins-bot: Fix articlequality thresholds [extensions/ORES] - 10https://gerrit.wikimedia.org/r/468999 (https://phabricator.wikimedia.org/T207614) (owner: 10Sbisson) [18:05:11] awight thanks and your welcome! [18:06:01] * awight breathes a sigh of relief at all the recovery alerts [19:09:33] harej: Here's an idea for the T206037, we introduce a concept of "model family". [19:09:34] T206037: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 [19:09:42] Oooh. [19:10:04] Individual models can have arbitrary names per-wiki, and we can include experimental models e.g. "editquality-wordnet" [19:10:24] but when you get model info, the models can be grouped by the client's use-case [19:11:02] i.e., A client can detect multiple editquality models and surface in the UI as appropriate. [19:12:32] So in ORES land, a "model" is a use case, basically. [19:12:46] Or rather, a model family is a use cas.e [19:13:19] Hmm. I might not say it that way. Use-case is too context specific. [19:13:34] But I think the rest of the idea is pretty solid. [19:15:20] "model family" is very close to "judgment schema" for JADE [19:16:03] This makes me feel weird about "contentquality"/"articlequality" all of a sudden :) [19:16:22] Would we put a "textcomplexity" model in "articlequality"/"contentquality"? [19:16:30] BTW, the specific articlequality scale is already loaded dynamically from configuration. It seems like a good precedent for model "species", that it can still have subspecies with their own flavor. [19:16:52] That works for me, conceptually. [19:17:07] halAFK: Good point, I hadn't thought about a wider variety of models in afamily. [19:17:28] Would "talkquality" belong? [19:17:44] My original idea was that everything in a family is very similar [19:17:49] But this is an interesting twist. [19:17:57] I think. But we'd have an "aggressiveness", "thankful", etc. models in there. [19:18:41] How would a client filter for compatible models? [19:18:59] "schema" seems like a really useful granularity [19:19:16] Hmm. Not sure they would. We might want to include "model family" in ORES somehow. [19:19:19] although articlequality having a variable scale must always be in mind [19:19:19] I think that is interesting :) [19:21:46] I think "talkquality" is the wrong thing to call it. [19:22:06] We're really only looking at "quality" in a really vague and abstract sense, unlike contentquality, where we are measuring against quality standards [19:22:38] I may be splitting hairs but I think discussion has different qualia from content. [19:24:02] +1 My gut agrees, but I thought it would be a good hare to split to help us define what contentquality is [19:24:49] a good hare, eh? [19:25:05] * awight stiches it back together again [19:26:19] so, if models can have arbitrary labels and "model family" is used for client autodetection, then perhaps "articlequality" and "itemquality" are the correct public-facing labels for those models and we aren't renaming? [19:28:02] Tangential thought: model configuration could provide a boolean to hide a model from the default "all models" response. This gives us a really nice soft migration path from old model names. [19:28:39] Client are expected to specify the list of models explicitly for forward-compatibility, and only humans or autodiscovering clients should request "all". [19:28:51] *^clients [19:35:43] (03CR) 10Hoo man: [C: 032] Clean up test annotations; coversNothing for integration test [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468192 (owner: 10Awight) [19:39:18] back now and ready for deploy [19:41:03] redis task tracking? Awesome, lmk if I should watch the kettle boil or anything [19:41:20] * awight opens a dashboard [19:41:26] sure, Everything is fine so far [19:43:27] (03Merged) 10jenkins-bot: Clean up test annotations; coversNothing for integration test [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468192 (owner: 10Awight) [19:46:09] (03CR) 10Hoo man: [C: 032] More MoveHooks tests [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468193 (owner: 10Awight) [19:48:58] Our graphs are fun. I don't like this, though: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=43&fullscreen&orgId=1&from=now-30d&to=now-1m [19:49:17] halAFK: how is the conference? [19:49:18] The response time data expires hella fast, but it's a key long-term metric. [19:51:42] (03CR) 10jenkins-bot: Clean up test annotations; coversNothing for integration test [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468192 (owner: 10Awight) [19:55:52] (03Merged) 10jenkins-bot: More MoveHooks tests [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468193 (owner: 10Awight) [19:59:49] (03CR) 10jenkins-bot: More MoveHooks tests [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468193 (owner: 10Awight) [20:01:12] 10JADE, 10Scoring-platform-team (Current), 10Release-Engineering-Team, 10Continuous-Integration-Config, 10Patch-For-Review: Keep JADE compatible with MediaWiki LTS - https://phabricator.wikimedia.org/T207678 (10Legoktm) There's a similar task requesting PHP 5.x tests against older core versions for i18n... [20:36:55] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10awight) Let's explore an alternative to merging and renaming models: The conceptual similarity between `itemqualit... [20:37:48] 10Scoring-platform-team (Current), 10Documentation: Document JADE schema proposals and justifications - https://phabricator.wikimedia.org/T204250 (10awight) 05Open>03Resolved [20:38:14] https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&from=now-3h&to=now [20:38:34] Failure rate went up to the threshold and then recovered :)) [20:39:43] Amir1: Do you know what happened beginning at 20:13? It seems dramatic, https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=10&fullscreen&orgId=1 [20:40:00] deployment I guess [20:40:13] Maybe you turned off precaching? [20:40:18] no no [20:40:21] let me check [20:40:51] 20:06 deployed started and 20:28 was finished [20:41:21] I guess changing task tracking explains the difference here. Normally, deployment only restarts one worker at a time so there's never a cutout like that [20:41:34] yeah [20:41:35] https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&refresh=1m&from=now-1h&to=now-1m&panelId=15&fullscreen [20:41:46] btw. Response time metrics fell down to way below the deployment [20:42:37] Cool, I'm very much looking forwards to it! [20:42:45] That's expected was we are removing lots of not needed tasks from the task queue so workers are now utilized to do actual job instead of sitting there waiting for another task to finish [20:43:06] I guess what we're seeing in the current "all scores processed" graph is heavy traffic while we process the backlog of precache requests. [20:43:41] yess, that's an fantasic optimization you worked out [20:44:42] Dude, a ~40% improvement in response time 8D [20:45:52] 10ORES, 10Scoring-platform-team (Current), 10Patch-For-Review, 10User-Ladsgroup: Silence or address E_WOULDBLOCK warning - https://phabricator.wikimedia.org/T152012 (10Ladsgroup) Removing redundant lookup tasks basically cut our response time to half: {F26716106} median: 732ms vs. 1.25s 75%: 819ms vs. 1.3... [20:46:24] It's precache graph, you can similar drop in external response time too \o/ [20:46:30] akosiaris: FYI https://phabricator.wikimedia.org/T152012#4687140 [21:05:48] (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/JADE] - 10https://gerrit.wikimedia.org/r/469093 (owner: 10L10n-bot) [21:06:34] wikimedia/ores#1073 (celery4 - 0ee9fbe : Amir Sarabadani): The build passed. https://travis-ci.org/wikimedia/ores/builds/444702225 [21:10:15] (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/469099 (owner: 10L10n-bot) [21:38:29] Amir1: nice! [21:39:56] ^^ It would increase the base cpu usage from 1% to 2% but I guess we can handle that [21:52:10] nice safety margin 0_0 [22:08:56] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:09:05] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:09:29] That's me ^ [22:09:35] PROBLEM - ORES web node labs ores-web-02 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:10:05] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 4.716 second response time [22:10:35] RECOVERY - ORES web node labs ores-web-02 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.542 second response time [22:11:15] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 442 bytes in 0.546 second response time [22:12:57] :D glad our icinga nerve is wired up again [22:16:26] so I can walk on its nerve and make it scream 24/7 [22:19:02] PROBLEM - check load on ORES-web01.Experimental is CRITICAL: connect to address 10.68.17.182 port 5666: Connection refusedconnect to host ores-web-01.ores.eqiad.wmflabs port 5666: Connection refused [22:19:07] PROBLEM - puppet on ORES-web02.Experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:19:08] PROBLEM - check users on ORES-web01.Experimental is CRITICAL: connect to address 10.68.17.182 port 5666: Connection refusedconnect to host ores-web-01.ores.eqiad.wmflabs port 5666: Connection refused [22:19:57] PROBLEM - puppet on ORES-web01.Experimental is CRITICAL: connect to address 10.68.17.182 port 5666: Connection refusedconnect to host ores-web-01.ores.eqiad.wmflabs port 5666: Connection refused [22:20:07] PROBLEM - check disk on ORES-web01.Experimental is CRITICAL: connect to address 10.68.17.182 port 5666: Connection refusedconnect to host ores-web-01.ores.eqiad.wmflabs port 5666: Connection refused [22:25:04] btw. uwsgi has memory leak :D [22:26:21] oh noes! [22:26:29] Is this new? [22:26:47] no, I'm pretty sure I've seen it [22:27:20] Our available memory is quite low on eqiad, we should dial back the number of workers [22:27:41] I know there was a sawtooth, but that doesn't = a memory leak [22:28:27] I think the sawtooth is worse after a deployment because service respawning is in sync [22:28:34] it's a sawtooth because we implemented an automatic restart after n times [22:28:44] I think it's both for celery and uwsgi [22:30:07] The sawtooth could also be Python's garbage collection. We should run a small experiment to check what is driving this, maybe by temporarily changing the # of service responses before restart? [22:30:39] At least, GC is what I'd been assuming until you brought up the excellent guess about service respawning [22:32:25] I've seen uwsgi respawning in ores logs [22:37:14] For sure that it happens, I'm just wondering whether it's the main factor in this memory profile [22:38:39] The first period is 39 minutes, FWIW. [22:39:43] I'm reading that Python GC is triggered at a certain count of object allocations, which would give it a lot of the same appearance as a sawtooth caused by service restarts, they're both proportional to the number of requests served. [22:40:13] Knowing how often a typical worker restarts would be a nice clue here [22:41:46] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:42:55] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 3.556 second response time [22:44:34] btw. Basically I want ot bring down worker but web can't handle it [22:44:43] I'm going to spin up another web node [22:45:42] That sounds fun [22:53:42] I cut the size of the queue to half and size of the workers to one fourth.. [22:56:27] harej: Can I help with the Extension:ORES install you were mentioning? [22:57:20] Amir1: That's a huge jump in efficiency! So it turns out, workers really were stalled waiting for dependency tasks to complete? [22:58:20] For considerable part of it yes, but for labs it's also can be that the web nodes and the worker nodes are not made proportionably [23:16:05] (03PS3) 10Awight: Services tests [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468280 [23:16:07] (03PS4) 10Awight: Tests for JudgmentTarget [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468277 [23:16:09] (03PS7) 10Awight: Better signature for link table helper [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468284 [23:16:11] (03PS3) 10Awight: Test fixes and coverage [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468625 [23:16:13] (03PS19) 10Awight: Maintenance scripts for judgment indexes [extensions/JADE] - 10https://gerrit.wikimedia.org/r/466808 (https://phabricator.wikimedia.org/T202596) [23:16:15] (03PS15) 10Awight: Unit tests for JudgmentContent [extensions/JADE] - 10https://gerrit.wikimedia.org/r/468180 [23:17:07] RECOVERY - puppet on ORES-web02.Experimental is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:23:02] RECOVERY - check load on ORES-web01.Experimental is OK: OK - load average: 0.37, 0.19, 0.54 [23:23:08] RECOVERY - check users on ORES-web01.Experimental is OK: USERS OK - 0 users currently logged in [23:23:56] RECOVERY - puppet on ORES-web01.Experimental is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:24:08] RECOVERY - check disk on ORES-web01.Experimental is OK: DISK OK