[10:25:04] o/ [10:29:26] Hey Amir1 [10:30:16] I was planning to come to the WMDE offices today but I feel pretty beat up from the travel, so I think I'll aim for a nap first. But! I have to eat something for lunch. Interested in grabbing something with me? [10:30:26] halfak: hey, on my way to the office [10:30:34] Sure. When? [10:30:34] Breakfast/lunch time? [10:30:39] Now-ish. [10:31:05] Oh, I need to be at quarterly meeting now [10:31:16] Gotcha. I could wait an hour. [10:32:36] halfak: it's until 1400 :/ [10:33:15] Boo. OK I'm going to just grab some food then. [10:33:20] Any recommendations? [10:33:35] I was hoping to just grab a quick bite. [10:34:15] halfak: mall of Berlin is close by. Its third floor has everything [10:34:28] Sounds good. Will scope it out. [10:35:02] It has nice currywurste [11:03:46] 10Scoring-platform-team: Read through teahouse literature to find exact outcome metric. - https://phabricator.wikimedia.org/T209652 (10notconfusing) + past-paper chi2 on whether the user made an edit after X period - groups whether or not they were invited. + H1 so comparison be the number of people surviving... [11:04:34] 10Scoring-platform-team (Current): Read through teahouse literature to find exact outcome metric. - https://phabricator.wikimedia.org/T209652 (10notconfusing) p:05Normal>03Triage [11:07:04] 10Scoring-platform-team (Current): Incoporate newcomerquality model into a python package - training side - https://phabricator.wikimedia.org/T208365 (10notconfusing) [11:08:19] 10Scoring-platform-team (Current): Incoporate newcomerquality model into a python package - training side - https://phabricator.wikimedia.org/T208365 (10notconfusing) done - https://github.com/notconfusing/newcomerquality/commit/e888375e69a2ac19f9cf99317d3533439f3544df [11:12:54] 10Scoring-platform-team (Current): create prediction function for newcomerquality package - https://phabricator.wikimedia.org/T211192 (10notconfusing) [13:12:28] akosiaris: hey, for when you're around. Do you think we can do this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/477302 [13:31:23] 10ORES, 10Scoring-platform-team, 10Operations, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): Blubber should be able to make multi docker files per repo - https://phabricator.wikimedia.org/T210267 (10Ladsgroup) p:05Normal>03Triage [13:32:12] 10ORES, 10Scoring-platform-team, 10Operations, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): Blubber should be able to make multi docker files per repo - https://phabricator.wikimedia.org/T210267 (10Ladsgroup) p:05Triage>03Normal I didn't change the priority. [13:40:15] Amir1: yeah I think we can. I 've already tested it on ores1001 and seems harmless [13:40:29] I think we can roll it out gradually to all machines [13:44:28] Sure. Thanks. Maybe we can check size of memory on redis nodes. It would probably free up some space there [14:04:10] akosiaris: tell me when you want to proceed. Thanks! [14:04:19] I will check things here and there [14:05:36] Amir1: it's capped at 6G and will probably fill up just as fast. But it will mean more objects in it https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=39&fullscreen&orgId=1 [14:06:29] Amir1: this however is not good https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=35&fullscreen&orgId=1 [14:06:42] an increase in commands of 3x ? [14:06:47] hmm yeah, probably better cache hit ratio I guess but that's might not be super big jump [14:07:47] multi, setex, del, get commands [14:07:52] that should not happen. Specially since we just did one node [14:07:57] have all multipled since 13:40 [14:08:26] I 'll shutdown the worker on that node just to make sure it's that [14:08:35] akosiaris: it's external [14:08:43] the requests have been multiplied [14:08:56] 6k per min [14:08:57] Jesus [14:08:57] it times awfully well with my change [14:09:13] big coincidence if so [14:09:37] interesting [14:09:50] let me check [14:09:54] I 'll just stop celery on ores1001 just to rule it out [14:10:05] if it's external requests we should not see a change [14:10:18] just external requests [14:10:53] https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=1&fullscreen&orgId=1 [14:11:08] Sure. Good idea [14:18:31] Amir1: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=31&fullscreen&orgId=1 this has suspiciously fallen back to previous levels [14:19:12] oh [14:19:13] I can restart the worker and I am now pretty confident we will see the same [14:19:56] The external requests died down [14:20:11] That's super weird [14:20:58] there's also no scores errore since 10mins ago [14:21:27] want me to start the worker again so we witness it once more ? [14:21:56] IDK [14:23:05] I was thinking it might be because they have different serialization they might end up not picking it up but that's very unlikely [14:28:51] 10ORES, 10Scoring-platform-team: Tuning broken in some repos, needs revscoring 2 update - https://phabricator.wikimedia.org/T184727 (10Ladsgroup) p:05Triage>03Normal [14:29:05] 10Scoring-platform-team, 10User-Ladsgroup: Audit deployed editquality models and figure out why if the models are bad - https://phabricator.wikimedia.org/T194742 (10Ladsgroup) p:05Triage>03Low [14:29:17] 10ORES, 10Scoring-platform-team: ORES command line service sometimes hangs - https://phabricator.wikimedia.org/T205909 (10Ladsgroup) p:05Triage>03Normal [14:29:26] 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Growth-Team, 10MediaWiki-extensions-PageCuration: ORES "is model really enabled?" PHP API - https://phabricator.wikimedia.org/T205323 (10Ladsgroup) p:05Triage>03Low [14:30:47] 10Scoring-platform-team, 10Wikilabels: Wikilabels needs manual reboot when DB connection is broken - https://phabricator.wikimedia.org/T209604 (10Ladsgroup) p:05Triage>03Normal [14:31:07] 10ORES, 10Scoring-platform-team: Read timeout from enwiki when requesting non-existent revision - https://phabricator.wikimedia.org/T204984 (10Ladsgroup) p:05Triage>03Normal [14:31:33] 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Simplify and modularize the Makefile template - https://phabricator.wikimedia.org/T190968 (10Ladsgroup) p:05Triage>03Low [14:54:52] akosiaris: I found out why it's happening [14:54:53] https://logstash.wikimedia.org/goto/7bd46fd52b86553f93f1c1e011ee7f2d [14:55:43] basically even though we put it to accept both json/pickle, it doesn't. It only accepts the one it's set to [14:56:28] it means to properly test it we need to get it deployed on all nodes and restart them at the same time [14:56:39] akosiaris: one thing that we can do is to do it for codfw first [14:57:26] Amir1: I am in the middle of an unrelated migration. I 'll ping you later [15:02:51] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [15:05:16] sure [15:21:45] 10ORES, 10Scoring-platform-team: Exception killing threads in ORES celery workers - https://phabricator.wikimedia.org/T182862 (10Ladsgroup) p:05Normal>03Lowest I tested this on localhost. It doesn't happen anymore when a timeout happens. (Put a time.sleep() in the timeout function in "with" part. The sigte... [15:30:53] Amir1: I am around now [15:31:06] what ? that would mean a pretty extended outage [15:31:21] and without even knowing if it works [15:33:22] If we restart everything at the same time, it should not be that problematic but I trust you [15:40:23] tbh I am fighting to understand just what happened [15:40:42] so we change the celery worker (and only that) to only parse json encoded jobs, right ? [15:41:00] and also serialize the result and put them back in the result backend in json [15:41:17] so questions. doesn't uwsgi also require some changes to inject the job ? [15:41:38] * akosiaris doesn't expect answers btw, just doing a brain dump [15:42:12] this is celery settings the problem is uwsgi expects pickle to read it from (if it's not ores1001) and if the score is processed in ores1001, it's json [15:42:55] this setting is the bridge between uwsgi and celery [15:43:28] wait, I did not restart uwsgi when changing that setting [15:43:33] I only restarted celery [15:44:27] if you don't it would increase the error rate because uwsgi everywhere wants pickle but ores1001 is giving out json [15:45:07] but even if you do it, it would still error because most of them would be processed somewhere else [15:52:20] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [15:57:18] akosiaris: Does this make sense to you? ^ [15:57:42] not particularly [15:58:02] I still haven't figured out why the amount of redis commands tripled [15:58:19] I can understand uwsgi croaking and dieing/ignore the json results [16:01:35] akosiaris: lots of clients (including cp, mediawiki) retry if error out [16:01:49] the number of external requests went up because of that [16:04:21] cp/mediawiki is not external however [16:04:36] or by external you mean != precached ? [16:05:35] yup [16:08:02] akosiaris: also we can compress the celery results too that would increase our space so much. What do you think? [16:08:33] space ? [16:08:47] we aren't constrained really by space, are we ? [16:09:07] well I mean on redis hosts [16:09:07] by space I mean memory, as it's loaded up in memory [16:09:21] we are currently at 6GB, we can increase it [16:25:46] akosiaris: it would increase our cache hit rate, how much is the upper limit [16:26:56] 75% ? [16:27:07] that's not necessary btw [16:27:19] I am the assumption that it would increase the cache hit ratio [16:27:33] it might, or it might not, depending on the requests [16:27:59] or it might need excessive memory increases for minimal cache hit ratio increases [16:28:07] the diminishing returns law [16:28:44] yeah, that's true [16:29:11] but anyway I am not against that [16:29:14] and it's quite easy [16:29:29] we can bump it to say 8G relatively easily and see what that gives us [16:31:08] sure [17:43:01] 10Scoring-platform-team, 10Operations, 10Release-Engineering-Team (Watching / External): Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10greg) Let me know if there's anything I can do, for now I'll just watch and respond as needed :) [22:02:50] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @tgr & @nuria - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [22:52:17] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @tgr & @nuria - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting