[09:24:21] halfak it seems like report needs some minor work [09:24:38] I will jump in to that :) [14:00:33] o/ ToAruShiroiNeko [14:00:38] what do you mean? What report? [14:00:48] o/ schana [14:00:54] How's the intertubes? [14:00:56] hi halfak [14:00:58] better [14:01:17] it turned out to be a Germany-wide outage [14:01:54] wow [14:02:04] Welcome back to the internet, Germany. [14:02:25] (at least for Kabel Deutschland) [14:05:55] I saw you sent a message re. monitoring before signing off last night. Any more progress there? [14:06:04] * halfak tries to find the gerrit link [14:06:11] https://gerrit.wikimedia.org/r/#/c/296535/ [14:06:55] $web_nodes = [ 'ores-web-03', 'ores-web-05' ] [14:06:58] :\ [14:06:59] akosiaris is right in that it's not a clean solution, but it should work as a temporary solution [14:07:21] Why isn't this coming from hiera? [14:08:02] the data would be duplicated without regard to whether it comes from production hiera or an array in the icinga config [14:08:23] In this case, there is already hiera config though. [14:08:23] https://wikitech.wikimedia.org/wiki/Hiera:Ores [14:08:36] but that's not available to the production icinga instance [14:08:41] (from my understanding) [14:09:03] Oh. Icinga can't be configured based on this hiera? [14:09:18] correct [14:09:32] Well... that's dumb. I'm sure there's a good reason. But *face palm* [14:09:42] separation of production / labs [14:09:59] So really, we need a labs icinga. [14:10:12] * schana backs away slowly [14:10:25] No worries. Would not suggest you spend time on that. [14:10:31] :D [14:10:36] Was just thinking that we need a "make labs great again" proposal [14:10:57] * akosiaris appreciates the Trump reference [14:11:05] even though I am not American [14:11:13] graphite.wmflabs.org is barely supported. There's no eventlogging in labs. grafana.wmflabs.org is unsupported. [14:11:22] EventBus doesn't work in labs. [14:11:27] FWIW, labs had an icinga install. It was a mess [14:11:49] akosiaris, is it necessarily a mess or could we do better now that we've learned some lessons? [14:12:41] there are some technical issues that can not really be solved. Like the fact that labs can't really have puppet exported resources. Which are used to get icinga configuration propagated to the icinga server from the icinga-monitored-nodes [14:13:03] the grafana/graphite stuff can get better of course [14:13:34] eventbus/eventlogging wise, I say I do not know. Eventbus, probably yes [14:14:00] Hopefully re. eventbus. I think it's a little silly that we have a 100% public feed that is 100% private. [14:14:17] I understand the kafka rewind privacy/security considerations. [14:14:25] but they are minor and not that difficult to deal with. [14:14:46] I just think that things @ Wikimedia should be built to be public first unless the difficulty is insurmountable. [14:15:04] er, I definitely did not mean a mirror of the current eventbus install in labs. I meant an eventbus install of it's own [14:15:12] probably beta related [14:15:30] if there is not something already about that [14:15:53] the big problem with labs is that it is multi-tenant [14:16:01] which production is effectively single tenant [14:16:05] while* [14:16:22] so the best you can do, is create a project in labs that tries to mimic production. that's beta [14:16:42] with all its shortcomings and problems (fixeable or otherwise) [14:16:45] akosiaris, IMO, the real value comes from the mirror of production eventbus. [14:17:04] Tool devs can harvest the value from such a thing more effectively than WMF staff. [14:17:11] Story of Wikimedia [14:17:55] I've built a lot of systems that maintain sync with a MediaWiki install like English Wikipedia. Eventbus would make this work a lot easier. [14:18:21] Further, I want to build secondary events of interest. E.g. [edits] --> [reverts] --> [reverts of goodfaith newcomers] [14:18:24] I 'll admit I have no idea [14:18:58] I might be barely hacking it as a software engineer, but I'm much more competent as a socio-technical systems researcher. ;) [14:20:42] * halfak wants to empower tool devs with a high quality event service and interesting tools for secondary events. [14:20:52] ORES will be really useful for interesting secondary events. [14:21:10] [edits] --> [likely vandalism edits] is just the start [14:23:22] * halfak calms down. [14:23:23] OK. So schana looking for tasks while you are blocked on monitoring stuff? [14:23:34] * halfak looks at ores-web-04 [14:23:44] I'm not sure it's clear to me why that node got depooled [14:24:04] well, I was thinking about the metrics collector stuff, but it seems I'll have to wait until after your refactor [14:24:14] schana, na. I'm not touching it. [14:24:26] You can also write up a quick demo and start a discussion about it. [14:24:47] Oh wait. I am going to remove "version" from every call to metrics collector. [14:24:52] But otherwise, it will stay as-is. [14:25:40] I was just talking about implementing the task, not refactoring it [14:25:59] Oh! The timeout error one? [14:26:04] yeah [14:27:16] related to your refactor, would it be possible for you to split out your changes into atomic commits? [14:27:41] giant commits that change all the things are hard to review and/or reference later [14:27:45] schana, was thinking about that. I'm thinking that it will be easiest to do that after the fact. [14:27:58] schana, never seen someone actually reference a commit. [14:28:30] by reference I mean trying to figure out what's going on / the reasoning behind a certain area of code [14:29:30] Hmm... Sure. Struggling to understand what you are imagining, but I believe you are imagining it! [14:30:28] imagine coming across this code base 10 years down the road and trying to figure out why 'version' was removed in metrics collector and only have the single commit message to guide you [14:30:59] (I've been in similar situations on different projects) [14:31:02] not fun [14:31:09] Yeah. Usually the original reason doesn't matter to me as much as reasoning about what it means now. [14:31:50] Still not arguing against doing what you need. [14:32:45] for example, one of the things I encountered was some code used to bypass a hardware bug that was present in an old platform, but it was mixed in with a giant commit [14:33:48] Oh. Seems like that would be better left as an inline comment. [14:34:10] or better yet, wrapped in some sort of function that was appropriately named [14:34:14] comments are not to be trusted [14:34:16] Yeah. [14:34:38] Seems like commit messages are a last resort then [14:34:59] Well, I suppose that reading/understanding/testing the code is a last resort. [14:36:04] I think the nuclear option is always a valid last resort :) [14:36:36] halfak they want us to fix our report [14:36:41] there are a few minor things [14:36:57] What report? [14:39:07] ieg [14:39:10] I got an email [14:39:25] OH man. That thing is old. [14:39:29] Anything you need from me? [14:40:00] I dont think so [14:40:06] but I wanted you to be aware [14:40:15] I forwarded it to you just in case you want to keep an eye [14:40:32] kk thanks [14:40:44] its really positive [14:40:49] just a few loose ends [14:41:27] schana, just realized we didn't finish the thoughts re tasks for you. [14:41:48] I'm curious to see what you are thinking re. using `logging` for metrics collection. [14:42:03] Would you be interested in making a quick demo of how that could work? [14:42:31] maybe after implementing the timeout error handling logic [14:42:49] OK well. Let's find something else. :) [14:43:31] Want to stay in ORES? Or would you like to get your hands dirty in some JS for Wiki labels? [14:43:35] (which, btw, there's already a built-in TimeoutError(OSError) in Python) [14:43:50] schana, indeed there is. [14:45:00] Seems like we should extend or re-use that. [14:45:06] File a task! [14:45:26] well, it's not really applicable to the exception being thrown [14:45:37] (FWIW, it seems that every library -- even some of python's standard libs -- also have their own TimeoutError) [14:45:57] Celery has one and concurrent.futures has one. [14:46:10] yeah [14:46:16] horray for generic names [14:50:08] So back to my task question, "Want to stay in ORES? Or would you like to get your hands dirty in some JS for Wiki labels?" [14:50:54] preferably ORES [14:51:09] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Testing multiple revisions in Swagger fails - https://phabricator.wikimedia.org/T132206#2421520 (10Halfak) [14:51:12] How about https://phabricator.wikimedia.org/T132206 [14:51:51] Or maybe https://phabricator.wikimedia.org/T137962 [14:52:18] 10Revision-Scoring-As-A-Service-Backlog, 10ORES, 10revscoring: Deploy new ORES with revscoring performance improvements - https://phabricator.wikimedia.org/T134784#2421521 (10Halfak) This is done. [14:52:24] 10Revision-Scoring-As-A-Service-Backlog, 10ORES, 10revscoring: Deploy new ORES with revscoring performance improvements - https://phabricator.wikimedia.org/T134784#2421522 (10Halfak) 05Open>03Resolved [14:53:22] it seems like the second one should have some nginx solution that shouldn't involve any engineering work [14:54:41] schana, not sure how that'll work for tracking [14:54:51] why would you want to track it? [14:54:52] I look forward to discussing your proposed solution [14:55:00] So that we know which IP/user-agent to block [14:55:02] Or limit [14:55:15] I'm fairly certain that can be handled internally by nginx [14:55:23] Cool! [14:56:56] although I'm wondering, what is the need for this? preventative? have we had dos/ddos attacks in the past (not directed at wikipedia)? [14:59:45] Metrics on who is using our service and for prevention [14:59:55] Yes we have had some accidental dos hits [15:00:17] At some point we might want to limit requests that do not include an email address in the user-agent. [15:01:22] that seems oddly specific [15:01:44] Essentially we want anyone using the service to provide contact information. [15:01:52] But I don't want to it magically not work if you don't [15:02:16] So it would be great if we could have a ceiling of request rates for requests lacking an email address in the user-agent. [15:02:21] E.g. return them a 503 error [15:02:34] But if you can provide contact information, no limit. [15:02:57] This is a wish that I'm not sure we can fulfill without dumping in a bunch of effort. [15:03:14] So in the meantime, I want to be able to track requests and block people as necessary. [15:03:16] I think the more established way to do that is with either keys or access tokens [15:03:26] schana, not at wikimedia [15:03:58] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client [15:06:19] IMO the whole access token pattern puts too much burden on the developer. [15:08:09] I'll think about it some and add my thoughts to the phab card next week [15:08:46] Cool. [15:08:57] Shall I move it to "active" and assign it? [15:09:05] I wouldn't yet [15:09:21] I subscribed myself, though [15:09:28] OK fine by me [15:09:43] well, I'll see you tuesday [15:09:54] have a nice Friday/weekend [15:10:01] You too! [15:10:02] o/ [17:40:47] 06Revision-Scoring-As-A-Service, 10ORES, 13Patch-For-Review: [spike] Find out if we can get health check warnings when a web node goes down from the load balancer's perspective - https://phabricator.wikimedia.org/T134782#2422115 (10Dzahn) merged and watched on neon. has been added to icinga https://icinga.w... [18:33:11] Looks like the location block may need "^~" before the path [19:20:25] Headache is hitting me again [19:20:29] back in 30 [19:57:36] back! [20:02:26] hope your headache is gone halfak [20:02:46] Mostly, it is :) [20:03:08] Got a new pair of glasses and I guess they are sitting with me a little funny. [20:03:22] Vision seems awesome, but 4-5 hours of staring at the screen --> headache [20:03:35] --> lay down for a bit --> feel better --> (repeat) [20:41:52] OK. FUn story. I've been debugging a weird pickling issue for 4 hours. [20:42:22] I can't reproduce the issue outside of wsgi, so I'm slowly building up a mock wsgi environment to figure out what is causing the problem. [20:42:48] Regretfully, the pickling error I am getting will *not* say where the error actually occurs so I'm not sure what is failing to get pickled. [20:55:47] oh dear [20:55:55] well I directly benefit from brexit it turns out [20:57:47] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 10ORES, 07Epic: [Epic] Complete outstanding tasks for ORES extension deployments - https://phabricator.wikimedia.org/T138251#2422667 (10Danny_B) [21:25:54] Amir1, halfak, ToAruShiroiNeko, labeling on plwiki is 100% done [21:26:01] \o/ [21:26:25] TarLocesilion: nice, I will be going to make the damaging model [21:26:34] tomorrow or the day after [21:26:40] http://labels.wmflabs.org/campaigns/plwiki/?campaigns=stats [21:26:41] ok, great [21:26:50] 10Revision-Scoring-As-A-Service-Backlog, 10Wikilabels, 10rsaas-editquality: Complete plwiki edit quality campaign - https://phabricator.wikimedia.org/T130269#2422747 (10Halfak) Done! http://labels.wmflabs.org/campaigns/plwiki/?campaigns=stats [21:27:13] 06Revision-Scoring-As-A-Service, 10Wikilabels, 10rsaas-editquality: Complete plwiki edit quality campaign - https://phabricator.wikimedia.org/T130269#2422749 (10Halfak) [21:27:44] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Build edit quality models for plwiki - https://phabricator.wikimedia.org/T139207#2422752 (10Halfak) [21:27:58] 06Revision-Scoring-As-A-Service, 10Wikilabels, 10rsaas-editquality: Complete plwiki edit quality campaign - https://phabricator.wikimedia.org/T130269#2131369 (10Halfak) [21:28:37] TarLocesilion, I'm going to assign "completing" the labeling to you so we remember to give you credit when the task is resolved. [21:28:56] Will be sitting in out "Done" column until our sync meeting on Tuesday. [21:29:02] 06Revision-Scoring-As-A-Service, 10Wikilabels, 10rsaas-editquality: Complete plwiki edit quality campaign - https://phabricator.wikimedia.org/T130269#2131369 (10Halfak) a:03tarlocesilion [21:29:08] Feel free to ignore :) [21:29:12] And thanks for letting us know :) [21:29:43] "feel free to ignore" <3 [21:30:27] I'm glad a few other users joined the taskforce, we'll be waiting for new goals [21:31:11] TarLocesilion, how do you feel about an article quality prediction model? [21:35:30] we don't have any general (cross-wikiproject) quality scale (except for DYK-like class, ofc FA, GA as well), so I think we could establish one first, but on the other hand, it might... be too difficult...? [21:36:16] I mean, it wouldn't be really helpful to use A-B-C scale [21:36:47] TarLocesilion, if you could gather a small group of interested Wikipedians to label ~3000 articles, we could apply it to the entire wiki. [21:36:56] But if that doesn't sound too valuable, no worries. [21:37:15] In the meantime, we'll try to get the models working and we'll need your feedback about whether or not they work in practice. [21:37:23] We'll, of course, be checking the fitness stats. [21:37:36] maybe this way: what problem is solved with the quality prediction model? [21:37:38] But real-life experience with them will tell the more important tale. [21:37:54] TarLocesilion, WikiEd uses it to find drafts that are ready for mainspace. [21:38:12] EllenCT has a bot that finds popular, but low quality articles to draw attention to them. [21:38:26] We can use it for basic stats when learning how the wiki is developing. [21:39:00] e.g. we use the model in English Wikipedia to look at what types of editing dynamics tend to lead towards positive quality changes most efficiently. [21:39:11] Would be good to know how those results compare to plwiki. [21:39:27] yes, we could try. [21:39:46] I'll ask my colleagues [21:39:47] It's a lot of work to assess articles like this, so I understand if it ends up not being worth it. [21:39:53] We can always keep it on the backburner :) [21:40:07] halfak: I was thinking we should try building a monitoring system to alarm us if a campaign passed 90% of tasks [21:40:19] Oh! One more thing, the Discovery team is looking to experiment with using article quality predictions in search results. [21:40:30] Amir1, sounds like a good idea to me. [21:40:38] We might even have a sort of health score for campaigns. [21:40:46] E.g. if there's activity at all, it's healthy. [21:40:55] If no activity for a week, it's dying [21:48:25] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Build edit quality models for plwiki - https://phabricator.wikimedia.org/T139207#2422793 (10Ladsgroup) a:03Ladsgroup [21:51:03] 10Revision-Scoring-As-A-Service-Backlog, 10revscoring: Train/test copyvio detection model - https://phabricator.wikimedia.org/T131481#2168286 (10kaldari) The new [[ https://tools.wmflabs.org/copypatrol | CopyPatrol ]] tool is slowly building up a dataset of manually confirmed copyright violations. If this is u... [21:51:03] 10[3] 04https://meta.wikimedia.org/wiki/https://tools.wmflabs.org/copypatrol [21:51:47] Amir1, have been exercising the refactor of score processing a lot. I actually have the applications running again, but I'm working on weird pickling issues. [21:52:05] I'm probably not going to do the hack session tomorrow, but I plan to get back into that soon. [21:52:59] I'm heading out now. If everything goes as planned, I'll have a full PR ready for the refactor by Tuesday EOD. [21:53:40] Amir1, FYI: https://phabricator.wikimedia.org/T139177 [21:53:52] Refactor will at least partially address this. [21:54:04] We should nearly cut out memory usage in half. [21:54:19] nice [21:54:24] I think we should also explore switching away from RF models when we can get nearly the same fitness with GB models. [21:54:26] looking forward to read and review [21:54:44] I'm not sure how the memory footprint of RF vs. GB works out. [21:54:55] hmm, that's possible won't be hard [21:54:57] Saving a file on disk != memory footprint, but probably related. [21:55:08] we can run some tests later? [22:01:33] Yeah. I think that's a good idea. [22:01:45] OK. Now I'm out of here. [22:01:53] Oh... let me push more changes for the refactor. [22:02:07] halfak, what classes can be set in the quality prediction model? do you use always the same FA-GA-A-B-C? [22:03:02] TarLocesilion, any classes you want are fine [22:03:08] They don't even need to be ordinal [22:03:56] OK.. changes pushed. [22:03:57] o/ [22:04:15] can you show me different examples? how does it work? I'd like to know what I'm presenting to people :) [22:09:13] 06Revision-Scoring-As-A-Service, 10ORES, 13Patch-For-Review: [spike] Find out if we can get health check warnings when a web node goes down from the load balancer's perspective - https://phabricator.wikimedia.org/T134782#2422818 (10Dzahn) so the full command we are actually running here, after unwrapping it,... [22:14:40] "labels": 5016, "tasks": 5000 [22:15:11] how could it happen :D [22:20:45] TarLocesilion: it was a known bug, we fixed it but won't happen for new campaigns [22:20:55] not old ones [22:22:28] and about classes, you can think of any classes but they should be able to cover all wikipedia (at least most of it). E.g. dividing to "stub" and "FA" is not enough because we have lots of articles that are in between [22:23:00] so, I would suggest "stub", "c", "b", "GA", "FA" [22:23:24] (we can do "stub" and "FA" only but it won't serve any meaningful purpose0 [22:30:57] 10Revision-Scoring-As-A-Service-Backlog, 06Community-Tech, 10CopyVio-tools: CopyPatrol should show ORES scores - https://phabricator.wikimedia.org/T139009#2422875 (10kaldari) [22:35:37] going to sleep [22:35:39] o/ [23:45:47] 06Revision-Scoring-As-A-Service, 10ORES, 13Patch-For-Review: [spike] Find out if we can get health check warnings when a web node goes down from the load balancer's perspective - https://phabricator.wikimedia.org/T134782#2423109 (10schana) I think nginx needs "^~" before the node location to properly match.