[06:33:15] PROBLEM - check load on ORES-web01.Experimental is CRITICAL: connect to address 172.16.3.131 port 5666: Connection refusedconnect to host ores-web-01.ores.eqiad.wmflabs port 5666: Connection refused [06:34:37] PROBLEM - check disk on ORES-web01.Experimental is CRITICAL: connect to address 172.16.3.131 port 5666: Connection refusedconnect to host ores-web-01.ores.eqiad.wmflabs port 5666: Connection refused [06:34:38] PROBLEM - check users on ORES-web01.Experimental is CRITICAL: connect to address 172.16.3.131 port 5666: Connection refusedconnect to host ores-web-01.ores.eqiad.wmflabs port 5666: Connection refused [06:35:40] PROBLEM - puppet on ORES-web01.Experimental is CRITICAL: connect to address 172.16.3.131 port 5666: Connection refusedconnect to host ores-web-01.ores.eqiad.wmflabs port 5666: Connection refused [06:53:16] RECOVERY - check load on ORES-web01.Experimental is OK: OK - load average: 0.11, 0.13, 0.26 [06:54:37] RECOVERY - check disk on ORES-web01.Experimental is OK: DISK OK [06:54:38] RECOVERY - check users on ORES-web01.Experimental is OK: USERS OK - 1 users currently logged in [06:55:22] RECOVERY - puppet on ORES-web01.Experimental is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:20:19] PROBLEM - ORES web node labs ores-web-01 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/ORES [08:20:21] PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/ORES [08:21:13] RECOVERY - ORES web node labs ores-web-01 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/ORES [08:22:31] RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 981 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/ORES [10:11:34] 10ORES, 10Scoring-platform-team, 10Operations, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Joe) I think this is a reasonable explanation, but how would you suggest we should fix our monitoring? [10:11:58] 10ORES, 10Scoring-platform-team, 10Operations, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Joe) p:05Triage→03Normal a:03Joe [13:16:18] o/ [13:20:55] 10ORES, 10Scoring-platform-team, 10Operations, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) I'm looking into what it would take to monitor a celery worker pool on a specific machine.... [14:50:30] 10ORES, 10Scoring-platform-team, 10Operations, 10serviceops: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) So, I've been trying explore the behaviors of `celery -A ores_celery inspect ping` to see if... [18:12:49] halfak: i'm back, still want to go over the ORES session orientation stuff? [18:13:28] Hey! Yes. Let's do it. [19:05:04] https://github.com/wikimedia/revscoring/pull/450 [19:05:06] accraze, ^ [19:05:30] My TODO is to try to implement the tree generation function that applies "list_of" to every leaf. [19:05:57] wikimedia/revscoring#1691 (session_orientation - 2a68f8c : halfak): The build failed. https://travis-ci.org/wikimedia/revscoring/builds/575488406 [19:06:09] awesome i'll take a look [19:07:12] brb changing locations [19:07:38] accraze, not much code so it should be straightforward. You can see me implementing some of the meta-datasources that I've been thinking about like "first" and "last" [19:07:46] * halfak runs away [20:28:32] Thinking out loud in the chat. Feel free to ignore: [20:28:50] So I'm converting all of our Dependents to list_of(Dependent) [20:29:05] All dependents are a member of a hierarchical DependentSet [20:29:52] We can do cool things like ask "is feature a in set X" or even "what features from feature list A are in set X" [20:30:09] We do this by treating the tree as a flat set when asking these questions. [20:31:14] So, when I'm working to recursively process a DependentSet, I could just ask for the flattened set and work from that. [20:31:31] The tree structure of the DependentSet doesn't tell me anything about actual dependencies. [20:31:38] OK I'm convinced. [20:31:49] * halfak goes back to writing code. [20:59:32] Bah! I need to put it back in the same tree though. Arg! [21:07:43] accraze, what's the name of the "maintainer" I should add for revscoring in readthedocs? [21:08:06] Aha! I added you but I imagine I need to add scoring-internal somehow. [21:10:34] I ran into an annoying wall with dependency rewrites so I'm hoping to get that done before I disappear. [21:24:04] I hope you can work with the maintainer status I gave you because I'm outta here. [21:24:07] o/ [21:29:56] that should work fine, thanks halAFK!