[00:04:55] (PM spamflood ensues) [00:25:57] one problem I'm going over in my mind is, who is actually responsible for building out JADE use cases? as in, who's going to write the code? [00:26:48] I understand that people randomly showing up to write gadgets is a thing that happens sometimes, but I'm hesitant of staking the project's success on that. [00:30:14] That's... quite a serious question. [00:30:29] It's almost like we need a $2M grant? [00:30:35] ;-) [00:40:05] (03CR) 10Awight: "> What I want to say is basically any query to these tables can be rewritten to query just the page table (with constructing the page titl" [extensions/JADE] - 10https://gerrit.wikimedia.org/r/456078 (https://phabricator.wikimedia.org/T203037) (owner: 10Awight) [06:14:49] (03CR) 10Krinkle: [C: 04-1] Introduce ext.ores.api (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) (owner: 10Ladsgroup) [06:18:31] (03CR) 10Krinkle: [C: 04-1] Introduce ext.ores.api (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) (owner: 10Ladsgroup) [06:21:45] (03CR) 10Krinkle: [C: 04-1] Introduce ext.ores.api (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) (owner: 10Ladsgroup) [06:25:24] (03CR) 10Krinkle: [C: 04-1] Introduce ext.ores.api (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/459549 (https://phabricator.wikimedia.org/T201691) (owner: 10Ladsgroup) [09:54:29] o/ [12:38:51] 10Scoring-platform-team (Current), 10ORES, 10WMF-JobQueue, 10Regression, and 2 others: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T204753 (10Ladsgroup) It seems strange because grafana says we didn't have any overload error except on spike (That was because of a deployment,... [13:43:43] afk for lunch and stuff will be back before the SoS [13:43:53] write things here if you want me to report [14:02:23] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @addshore & @Pablo_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [14:36:13] 10Scoring-platform-team (Current), 10ORES, 10WMF-JobQueue, 10Regression, and 2 others: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T204753 (10awight) This glitch predates PoolCounter by a few days, luckily. One action item we should take is to make the "scores errored" grap... [14:52:16] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @addshore & @Pablo_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [14:56:30] 10Scoring-platform-team (Current), 10ORES, 10WMF-JobQueue, 10Regression, and 2 others: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T204753 (10Ladsgroup) I reviewed all graphs and can't find what you're saying. The job failure rate is around 0.4% and it never was zero: https:... [14:58:09] back now [15:05:17] I'm going to hammer wmflabs setup for tests [15:23:07] pool counter? cool [15:23:20] Realizing that I don't have an admin login for grafana at the moment... the hecl. [15:23:25] *k [15:24:13] Amir1: hey lmk if you have any pointers to Wikidata getting rid of their link tables. [15:35:09] (got into grafana) [15:39:07] Amir1: You're editing grafana too? [15:39:21] awight: I'm at SoS [15:39:34] weird, I wonder then... [15:40:01] I've updated the error graph so it works on long timescales [15:40:22] u can see that errors rarely correspond to an overload spike. [15:45:35] 10Scoring-platform-team (Current), 10ORES, 10WMF-JobQueue, 10Regression, and 2 others: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T204753 (10awight) I made an adjustment to the graph, and it's now clear that there was no change in the overall error rate around switchover, i... [16:52:10] I can't join the social time today, I have too many things to do [17:06:01] hal [17:06:05] No halfak?! [17:06:25] guillom: He's OoO until next week. Anything I might be able to help with? [17:07:18] awight: I was hoping to ask him if there was one talk/video of his that he liked to use as a pointer for people who want to get an overview of ORES and rev scoring in general. [17:07:34] aha lemme look [17:07:50] i.e. a gentle, general overview, not necessarily a super-deep research one [17:08:28] guillom: Here's a recent one, https://www.youtube.com/watch?v=8-SfvKl1e3M [17:08:36] * guillom appreciates awight's help, and was otherwise going to beging a dance-and-chant using swearwords to summon the halfak. [17:09:03] begin* [17:09:11] Looking at the video now; thank you! [17:09:52] :-) You can still do the dance [17:26:14] https://etherpad.wikimedia.org/p/scoring_201819_q2_goals [17:26:41] 10Scoring-platform-team (Current), 10DBA, 10JADE, 10Operations, 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @jcrespo When you have a minute, I'd like to hear your opinion on calculated field joins, e.g. {P7570} I don'... [17:28:44] harej: right on! [17:46:31] I'm playing with the ores-staging node [17:46:32] in labs [17:46:44] in case something screamed about it [17:57:04] (03PS1) 10Anomie: Adjust for core change I4764c1c78 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/461441 (https://phabricator.wikimedia.org/T204669) [18:31:14] awight: If you can take a look at the fix I made for stupidest mistake I have ever done: https://github.com/wikimedia/ores/pull/262 [18:31:31] I hope I can deploy it to prod tonight [18:31:45] fun fact: debugging this took a whole day of mine [18:33:40] Amir1: Sorry that my code review was useless! [18:34:12] that was pretty sneaky [18:36:16] Sure but it was like a 5-line change in that file, I could have... read it more carefully. [18:36:32] Anyway, nothing burned down :-) [18:41:28] 10Scoring-platform-team (Current), 10ORES, 10User-Ladsgroup: ORES doesn't block hammering IPs - https://phabricator.wikimedia.org/T204862 (10Ladsgroup) [18:42:13] 10Scoring-platform-team (Current), 10ORES, 10User-Ladsgroup: ORES doesn't block hammering IPs - https://phabricator.wikimedia.org/T204862 (10Ladsgroup) This reason was that it just locks the IP and then immediately release it: https://github.com/wikimedia/ores/pull/262 [18:43:49] wikimedia/ores#997 (fix_poolcounter - 87f2d54 : Amir Sarabadani): The build failed. https://travis-ci.org/wikimedia/ores/builds/430680237 [18:44:41] random failure ^ [18:45:10] :) [18:48:57] wikimedia/ores#997 (fix_poolcounter - 87f2d54 : Amir Sarabadani): The build failed. https://travis-ci.org/wikimedia/ores/builds/430680237 [18:52:42] oh I see why it's failing, it's because the branch doesn't exist anymore :D [18:54:16] (03PS1) 10Ladsgroup: Bump ORES to HEAD [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/461451 (https://phabricator.wikimedia.org/T204862) [18:55:08] (03CR) 10Ladsgroup: [V: 032 C: 032] Bump ORES to HEAD [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/461451 (https://phabricator.wikimedia.org/T204862) (owner: 10Ladsgroup) [19:02:14] it's live on beta [19:02:16] hammer time! [19:03:31] and it works [19:24:13] oh heck yeah [19:26:24] the average response time when there is 6 simultaneous thread is connected is 3.4 seconds and when there is two connected it's 2.4 meaning the soft limit also works [19:34:25] I'm not sure what that's all about. What's the soft limit? [19:34:52] You're only allowing 3 workers per IP or something? [19:38:22] awight: the soft limit is 4 and the hard one is 7. If someone requests score with more than four connections we don't start the process of responding until the any of the first four ones are released [19:38:35] that actually increases the lock time drastically [19:38:43] but when there is the need [19:39:23] basically after four locks, the fifth one doesn't get response until one of the four ones gets released [19:39:31] ah very nice, is that just a courtesy to the caller, though? [19:39:35] making the whole number of workers/ip static [19:39:42] yup [19:40:01] and after seven ones, it starts to respond with timeout [19:40:04] Seems like we should save the sockets by dropping hard, and the client has a better chance of discovering the throttling [19:41:03] we have the policy of no more than four connections per IP [19:41:21] we are just enforcing it (with even some compromise) now [19:41:47] I guess we're not going to hit the socket limit, but we're using up a web worker and leaving it idle, right? [19:42:27] yup [19:43:01] DOWNTIMEEND - Host ORES-worker01.experimental is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms awight neutron network, firewall? [19:43:45] kk it's probably fine for our use case of a single researcher with OCD [19:44:01] DOWNTIMEEND - Host ORES-web01.Experimental is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms awight neutron network, firewall? [19:44:01] DOWNTIMEEND - Host ORES-worker02.experimental is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms awight neutron network, firewall? [19:44:01] DOWNTIMEEND - Host ORES-web02.Experimental is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms awight neutron network, firewall? [19:44:01] DOWNTIMEEND - ssh on ORES-redis02.experimental is CRITICAL: CRITICAL - Socket timeout after 10 seconds awight neutron network, firewall? [19:44:08] That's basically the reason we are building it :P [19:44:11] I'm going to write an announcement soon [19:44:15] It would make a DDoS about 2x easier, but that's not in our threat model [19:44:18] right [19:44:33] wanted to make sure this works first [19:44:39] I do feel odd about being "polite", though. Is this a best practice from somewhere? [19:44:58] it doesn't make the DDoS easier, we are always prone to DDoS :D [19:45:12] that's what alex told me [19:45:19] exactly--well, it means it can be done with half the number of zombie nodes [19:45:57] Cool. Congrats for getting this working so well! [19:46:09] "that's what alex told me" <- that's about the polite thing [19:46:33] "well, it means it can be done with half the number of zombie nodes" web nodes were never was our bottleneck [19:46:38] workers were [19:47:07] we can spin up large number of web nodes without any trouble [19:50:12] But the number of web nodes is tuned to only a small margin above the number we need to serve the theoretical max [19:50:28] It's calculated to keep the celery queues full. [19:51:01] now I think we can increase it [19:51:09] I ask alex tomorrow [19:51:18] the thing is web nodes are not CPU intensive [19:52:06] that's true, the only reason we don't have an unlimited number is RAM [19:55:03] with COW it should not increase much (hopefully). We'll check with Alex [20:08:50] 10Scoring-platform-team (Current), 10ORES, 10WMF-JobQueue, 10Regression, and 2 others: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T204753 (10Krinkle) Should the following error have have its own task, or should it be tracked here as well? ```name=message [{exception_id}] {... [20:09:05] PROBLEM - ssh on ORES-redis02.experimental is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:19] Currently the number of workers should be pretty close to correct. I don't think this patch changes anything about that math, now that I understand better. [20:19:48] disappearing for 1-2hr... [20:20:46] (03PS6) 10Awight: Change schema to a list of heterogenous judgments [extensions/JADE] - 10https://gerrit.wikimedia.org/r/456424 [20:20:48] (03PS2) 10Awight: [WIP] update schema to include endorsements; drop page judgments for now [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461255 [20:20:50] (03PS8) 10Awight: [WIP] Secondary indexes for JADE pages [extensions/JADE] - 10https://gerrit.wikimedia.org/r/456078 (https://phabricator.wikimedia.org/T203037) [20:22:32] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Secondary indexes for JADE pages [extensions/JADE] - 10https://gerrit.wikimedia.org/r/456078 (https://phabricator.wikimedia.org/T203037) (owner: 10Awight) [20:24:59] (03CR) 10jerkins-bot: [V: 04-1] [WIP] update schema to include endorsements; drop page judgments for now [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461255 (owner: 10Awight) [20:26:20] (03CR) 10jerkins-bot: [V: 04-1] [WIP] update schema to include endorsements; drop page judgments for now [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461255 (owner: 10Awight) [20:27:33] (03CR) 10jerkins-bot: [V: 04-1] Change schema to a list of heterogenous judgments [extensions/JADE] - 10https://gerrit.wikimedia.org/r/456424 (owner: 10Awight) [20:28:53] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Secondary indexes for JADE pages [extensions/JADE] - 10https://gerrit.wikimedia.org/r/456078 (https://phabricator.wikimedia.org/T203037) (owner: 10Awight) [20:37:33] 10Scoring-platform-team (Current), 10ORES, 10User-Ladsgroup: Test poolcounter support for ores in beta cluster - https://phabricator.wikimedia.org/T201825 (10Ladsgroup) a:03Ladsgroup [20:37:41] Amir1, hey [20:37:49] Hey [20:37:59] just saw your wikitech-l email [20:38:23] are you familiar with labs public networking? [20:38:36] not much [20:38:39] ok [20:38:48] is there any doc? [20:38:56] possibly but I can just explain what I know [20:39:02] basically there are instances with floating IPs and instances without floating IPs [20:39:46] if you make a connection out from an instance with a floating IP, that's what will be used [20:40:29] but if you make a connection out from an instance without a floating IP, you get what is currently 208.80.155.255, which I named internal-server-nat.wmflabs.org [20:41:13] and so this IP can get shared across multiple instances from all projects [20:41:41] interesting [20:41:51] what about toolforge? [20:42:09] now I think toolforge assigns floating IPs to all its exec nodes [20:42:47] so they should be fine [20:43:22] assuming traffic from toolforge doesn't get excessive on a particular exec node [20:43:32] I'd keep an eye on that 208.80.155.255 IP for the time being [20:43:32] cool, so whitelisting only 208.80.155.255 would work [20:43:48] well I'm not saying you have to whitelist but I'd just recommend keeping an eye [20:44:52] sure [20:45:02] Thank you for the information [20:47:49] that's enough for today [20:47:51] np [20:48:06] "that's enough for today" <- I meant work [20:48:13] I figured :) [20:48:28] it's already 23 here [20:48:35] see you tomorrow! [20:48:45] Thank you again Krenair [20:48:56] you're welcome [20:49:03] yeah it's very late to be working :S [21:05:53] Oh, it's also possible that some stuff in prod gets to see labs private addressing (instead of public addressing), but not all. There's a ticket floating around somewhere for that. [21:26:16] (back again) [21:33:38] harej: Proposal which will require halfak... I'd like to change the term "schema" to "scale" [21:34:04] There might be good reasons not to, but "schema" just seems confusing. [21:39:10] PROBLEM - ssh on ORES-redis02.experimental is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:42:53] Amir1: hi! Just saw the parallel connections limiting... [21:44:04] the Dashboard (both the Wiki Ed instance on linode, and the global instance on CloudVPS) have been configured to make up to 50 concurrent requests. [21:45:19] I guess that probably rarely actually goes above 4 open requests, since the 50 threads don't necessarily all have open connections at the same time... [21:46:07] but I'll go ahead and change that to 4. [21:46:08] ragesoss: hey, can you get some stats about it? I can whitelist the ips [21:46:36] I can get the IPs. anything else that you need? [21:46:44] O/ im back from the dead [21:46:54] Nothing so far [21:48:43] ragesoss: I will implement it tomorrow. If you file a task so I don't forget, that would be amazing [21:49:06] okay, will do. [21:49:25] Amir1: if ill remember ill remind you as well :) [21:49:44] There's this already. Should I just add the IPs there? https://phabricator.wikimedia.org/T201826 [21:50:24] ragesoss: i think it’d be better to do a new task imho but its up to amir tbh [21:52:07] ragesoss: Hi! Just wondering if the requests are made through a job queue, or if they're made directly. Asking because limiting the client concurrency is IMO a good idea either way. [21:53:35] ... but don't want to cause you hardship and lots of extra coding. [21:56:31] awight: they are done in a background process, although not through the job queue per-se. But the actually concurrent requests happen via threads: the app batches the revs we want data for in sets of X, where X is the value set for OresApi concurrency. That's currently 50 (which is what I put it to on the advice of halfak, about 2 years ago), but I can easily change it. [21:56:47] ah [21:57:01] Batch size is something different--you're making a single request for all 50, right? [21:57:58] awight: no. is is possible to get data for 50 revs in one request? [21:58:06] I thought you could only get 1 rev per request. [21:58:12] https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging%7Cgoodfaith%7Cwp10&revids=860164911|860164910|860164909&precache=1&format=json [21:58:16] \o/ [21:58:22] yes, enjoy <3 [21:58:31] oh, nice. [21:59:13] 4 threads of 50 revs at a time is no problem [21:59:19] okay. that will take a little bit of work to implement, but that'll be great. [21:59:40] when I implemented it two years ago, there was definitely no multi-revision requests. [22:00:17] I can believe that, there have been a lot of improvements to the server and the API format. [22:00:21] FYI: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&orgId=1 [22:00:51] That was there since the beginning [22:01:07] at least, not with the ?features option maybe? [22:01:09] We're handling 2-4K requests per minute routinely, to give you a sense of how your requests will impact the servers. [22:01:28] ?features is expensive for sure, you might even hit the response size limit. [22:01:34] I think that's around 1.5MB [22:01:43] but I see that you can get features for multiple revs at a time now. [22:02:17] Feature injection isn't possible per revision, if that's something you need. [22:02:30] yeah, don't need that per revision. [22:02:34] just need to fetch the features. [22:02:40] * awight mops virtual sweat from brow [22:02:57] * awight massages ORES cluster's shoulders [22:04:53] awight: only 2-4k requests... pssh amateur hour xD jk [22:08:17] Zppix: It's hard to get the gerbils to work any harder than that without shedding fur everywhere. [22:12:30] Dont we hire people to clean the fur up (on site people) [22:12:52] * Hauskatze is furry [22:13:36] Hauskatze: that can be interperted many different ways [22:13:45] as a cat, I have fur [22:13:48] :P [22:14:23] * Hauskatze is happy lots of patches made by him are getting merged today \o/ [22:14:30] * awight is glad to have Hauskatze confirm suspicions that Hauskatze is Felinis vulgaris. [22:14:43] Felis S. Catus please [22:14:57] no stroopwaffels for you awight [22:14:59] :P [22:15:04] awight: so he is a vulgar cat? jk [22:16:25] * awight perks up at the thought of rubbing caramel all over gerbil-furred carpet [22:17:10] Lol [22:17:34] https://meta.wikimedia.org/wiki/Association_of_Stroopwafel_Addicts [22:18:55] Very nice. I'm still waiting to have a cheese stroopwafel, I think that will be my jam. [22:21:50] I like cheese [22:33:34] 10Scoring-platform-team, 10ORES, 10User-Ladsgroup: Add Wiki Education Dashboard and Programs & Events Dashboard to ORES connection whitelist - https://phabricator.wikimedia.org/T204897 (10Ragesoss) [22:33:58] 10Scoring-platform-team (Current), 10ORES, 10Patch-For-Review, 10User-Ladsgroup: Use poolcounter to limit number of connections to ores uwsgi - https://phabricator.wikimedia.org/T160692 (10Ragesoss) [22:34:00] 10Scoring-platform-team, 10ORES, 10User-Ladsgroup: Add Wiki Education Dashboard and Programs & Events Dashboard to ORES connection whitelist - https://phabricator.wikimedia.org/T204897 (10Ragesoss) [22:37:26] (03Abandoned) 10Awight: Change schema to a list of heterogenous judgments [extensions/JADE] - 10https://gerrit.wikimedia.org/r/456424 (owner: 10Awight) [22:37:33] (03PS3) 10Awight: [WIP] Update schema to current consensus [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461255 [22:42:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Update schema to current consensus [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461255 (owner: 10Awight) [22:43:58] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Update schema to current consensus [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461255 (owner: 10Awight) [23:09:14] PROBLEM - ssh on ORES-redis02.experimental is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:37] awight: "scale" does seem more user-friendly. [23:21:46] But, does it really matter what we call it? [23:22:22] harej: Editors will be exposed to our jargon, so I think it does matter in the long run. [23:22:40] Good point [23:23:01] Also, we should have the glossary pretty well locked in even before the pilot deployment, to avoid shifting the ground under users' feet. [23:23:21] There's plenty of time yet, I just wanted to flag this term. [23:29:39] TODO: glossary :-) [23:33:19] (03PS4) 10Awight: Update judgment content schema [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461255 [23:34:14] ^ that gets us up to speed with the content schema. [23:34:32] harej: FYI, https://gerrit.wikimedia.org/g/mediawiki/extensions/JADE/+/refs/changes/55/461255/4/jsonschema/judgment/v1.json [23:38:16] (03PS5) 10Awight: Update judgment content schema [extensions/JADE] - 10https://gerrit.wikimedia.org/r/461255 [23:38:41] Data that validates to that schema, https://gerrit.wikimedia.org/g/mediawiki/extensions/JADE/+/refs/changes/55/461255/5/tests/data/valid_diff_judgment.json [23:51:17] 10Scoring-platform-team, 10JADE, 10Documentation: Write glossary of JADE concepts - https://phabricator.wikimedia.org/T204905 (10awight)