[14:14:57] halfak: o/ [14:26:21] o/ [14:42:28] o/ [14:44:49] halfak: did you get the ping for https://phabricator.wikimedia.org/T164994? [14:45:11] I don't know what to answer there :/ [15:45:35] * halfak does the meetings [16:26:49] halfak: I gotta go now. I hope you can comment on the mentioned ticket as it seems blocking the GetSuggestions patch. Thanks :D [16:54:29] glorian_wd, just responded. [18:01:49] halfak: im back again [18:03:04] o/ Zppix [18:03:24] Did you find a way to review our 500s check and why we didn't get an icinga about ORES issues? [18:03:34] I imagine the config is public I just don't know where it lives :/ [18:04:04] FYI: last icinga page i have for production is April 26th [18:04:37] I havent had a chance ill talk to ops, could you maybe give them a heads up ill be talking to them so they know why im asking and that you know about it :P [18:33:08] halfak does icinga query against ores ie hit a domain [18:33:25] halfak|Lunch | Zppix ^^ [18:33:26] ? [18:34:03] paladox: we talked to them in operations, it looks to me theres no changeprop coverage where the issue lies [18:34:07] or does it run a check on the host ie for example check_service or something similar? We could just set it up on our labs instance to query a domain if that is what you want. [18:34:17] Zppix what if we hit the domain [18:34:21] ie use check_http? [18:34:28] paladox: I dont know how they have changeprop setup [18:34:48] Zppix we doint neccisary have to have changeprop. We could use check_http [18:34:50] ill ask aaron when he returns [18:35:02] to make sure it dosen't return any errors when checking if the website is up. [18:35:22] its not the website that reported the 500 iirc its the changeprop [18:35:28] Ok [18:35:37] Zppix so the website was unaffected? [18:36:23] paladox https://wikitech.wikimedia.org/wiki/Incident_documentation/20170613-ORES [18:37:16] ok thanks [18:40:20] Zppix what's check ores worker for? [18:40:21] https://github.com/wikimedia/puppet/blob/401973ab7f79fd4567749fe074ccce1d47446581/modules/nagios_common/files/check_commands/check_ores_workers [18:40:53] I have no clue, im still looking into the situation myself [18:41:07] ok [18:54:51] Amir1: im around if you need me to watch anything while you do that deploy [18:55:21] sure, thanks [18:55:30] ok [18:55:32] np [19:25:03] halfak|Lunch: https://ores.wikimedia.org/example [19:28:49] Amir1, cool! [19:29:11] Amir1, did you do a deploy today? [19:29:18] * halfak reads backscroll [19:29:25] halfak: yeah [19:29:38] I was trying to see a 404 and I was getting 500 instead [19:29:56] I realized the config variables are missing and causing "KeyError" [19:30:04] so I added them fast and deployed [19:31:11] https://phabricator.wikimedia.org/T113114#3352811 [19:31:14] OK good deal :) [19:31:36] https://gerrit.wikimedia.org/r/#/c/359224/ [19:31:43] This gives me an idea for deployment checks we should have. [19:31:50] maybe even a ci check. [19:32:59] Yeah, I think we should check for configs in test setup (https://github.com/wiki-ai/ores/blob/master/config/ores-testwiki.yaml against prod config [19:34:01] I've been meaning to rename that file anyway. I'm going to rename that and think about prod-like configs strategy [19:34:15] cool, keep me in loop [19:34:22] I'm calling it a day [19:34:27] will do. have a good one dude. :) [19:34:34] you too [19:34:46] See you tomorrow, it will be mostly ores tomorrow [19:34:55] \o/ :) [19:35:01] o/ [19:40:00] https://phabricator.wikimedia.org/T168007 [19:40:06] [14:40:00] https://phabricator.wikimedia.org/T16 [19:40:08] Woops [19:40:13] https://phabricator.wikimedia.org/T168007 [19:51:07] halfak: did you see my reply [19:51:10] on ticket [19:51:46] No replies on that ticket [19:51:57] Do you mean one we looked at earlier today? [19:52:06] https://phabricator.wikimedia.org/T167830#3352688 [19:52:25] Ahh yes. I can't answer that. [19:53:09] halfak: if we do do grafana only people with prod access i believe will be able to modify the checks... meaning only you and amir could [19:53:34] i think thats the only way though [19:53:42] considering its prod instances [19:54:49] Zppix, I don't have +2 on puppet repo [19:54:56] You submit patches same as I do. [19:55:14] iirc grafana is edited via webui [19:56:41] I think what needs to happen is we communicate with services, see what they think we should do, go over the options, then talk to operations about implementing them [19:56:59] Oh! Grafana. [19:57:01] Sorry [19:58:14] halfak i wonder would the 500 errors be exposed over http? [19:58:23] If so we could implement it in icinga2 :). [19:58:25] paladox, huh? [19:58:31] 500 is an http error code [19:58:43] halfak the errors you were talking about earlier in -operations. [19:58:43] Oh I think I see what you mean [19:58:54] halfak: what domain would the 500 come from? [19:58:59] No way we're relying on a labs service to monitor a prod service :/ [19:59:39] ok [19:59:52] I dont think we can technically anyway unless it was HTTP [20:00:49] halfak: is this only occuring on scb1001? [20:01:25] It was. It's not anymore. [20:01:38] so all scb hosts? [20:03:08] cause according to the incident docs it was just scb1001, if its all of them we need to look at the graphs for all scb hosts for when the error occoured [20:04:23] let me take a quick look into the graphs again, let me see if any similar thing has happened at all in the past with other services running on scb, and I'll look at what they did. halfak [20:05:08] It was scb1001 but it is not anymore [20:05:44] halfak we doint have to rely on labs for the monotoring :), just as a backup. :) [20:07:37] halfak: i see what you mean now, im still going to do some more looking, but honestly, i think we need to ask services what they think (if possible), then talk about what we want to do, then implment. [20:08:23] Zppix, services has this on their backlog too. [20:08:35] pchelolo is a services engineer [20:10:32] oh [20:11:00] halfak: okay let me look at the graphs as of now and compare them to when the event occoured [20:14:50] 20 vs 3 okay, why didnt icinga pick that up? [20:15:35] halfak: Who did the deploy that caused that, amir or you ? [20:18:50] Zppix halfak https://gerrit.wikimedia.org/r/#/c/359243/ :) [20:18:53] There was not a deploy that caused it [20:19:00] It was another service [20:19:04] We noticed it during a deploy [20:19:11] Because icinga didn't go off [20:19:22] https://wikitech.wikimedia.org/wiki/Incident_documentation/20170613-ORES [20:19:36] halfak: oh yes i see that now... Who maintains the pdf service that caused the high cpu do you know? [20:19:52] Services [20:19:56] Doesn't matter though [20:20:09] Because the only real problem here is that our icinga check failed. [20:20:30] See the notes in the incident report and the followup actions. [20:23:06] Looks like we get pretty high accuracy with the trwiki article quality model :) [20:24:04] lol [20:24:20] halfak: i got an answer the checkrate i put in the task T167830 [20:24:20] T167830: Extend icinga check to catch 500 errors like those of the 20170613 incident - https://phabricator.wikimedia.org/T167830 [20:28:47] what ever happened to wmbot [20:30:00] what do you mean? [20:30:12] wm-bot4: is here [20:31:01] Im liking the grafana route, if you agree halfak i can look into having that setup [20:31:22] wm-bot isn't posting about relevant phab stuff. [20:31:38] Someone else set that up in the past. wm-bot has been silent in this channel for a while. [20:32:46] wm-bot4: doesnt do that its wikibugs [20:32:58] oh! that then. [20:33:05] i have set up wikibugs a few days if not a week ago i dont know why its not working] [20:33:12] ill talk to someone about that [20:33:23] Zppix wikibugs is maintained by legoktm [20:34:44] paladox: i asked another maintainer that merged the change i made [20:34:48] ok [20:40:05] Zppix im going to migrate some hosts from director to public repo :) [20:40:12] Makes easier at editing them [20:40:23] not for me :( [20:40:38] but whatever [20:42:19] Zppix oh why? [20:42:26] You have +2 rights in the repo [20:42:42] I know but i like UIs but whatever i need to learn to do it anyway [20:42:56] Oh [20:43:06] Zppix yeh, editing file by hand will be easier [20:43:18] as director does not support everything icinga2 does :) [20:43:55] just keep director incase i dont have access to gerrit or something [20:51:59] ok [20:52:14] Zppix you can create changes through the ui :) [20:53:33] I know.. but in the off chance i cannot access gerrit i would like a fall back:P [20:53:42] halfak: i sent a msg to ai [20:54:06] cool :) [20:54:48] ok [20:54:50] halfak: i also do think grafana is the way to go, what do you think? [20:54:56] looks like a gerrit problem has popped [20:54:57] up [20:55:05] https://phabricator.wikimedia.org/T168012#3353004 [20:55:40] Zppix, I think so yeah. [20:55:53] Then again, I think if any request to ORES fails with a 500 I want a notification [20:56:05] Does changeprop have a nice grafana for that? [20:56:09] *graphite metric [21:02:10] i cant see anything on grafana ui for changeprop [21:04:43] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Implement wp10 model for trwiki - https://phabricator.wikimedia.org/T164671#3353021 (10Halfak) ``` cat datasets/trwiki.labeling_revisions.w_cache.2k.json | \ revscoring cv_train \ revscoring.scorer_models.Gradien... [21:05:02] yay! Wikibugs! [21:05:06] Welcome back botbro [21:05:06] halfak: your welcome :) [21:06:39] graphite is probably the best way to find a metric. [21:06:44] But you can't see that [21:06:45] im looking [21:06:50] via asking [21:12:24] Zppix, so. Sorry for the shitty thoughts from other folks. [21:12:45] ive signed mutiple things i've signed an thing (i think it was a form of nda) for orts and another one confinitaly and server access responiblities [21:12:53] I get why they are skeptical of volunteers who bounce around different projects but I don't get why they are dismissive. :\ [21:12:59] halfak: meh thats nothing [21:13:22] Anyway, if we can use your NDA to get you access to graphite, let's do that. [21:13:37] i have no clue how to start that :P but ok [21:14:08] https://phabricator.wikimedia.org/T56713 Interesting [21:15:01] https://phabricator.wikimedia.org/T91513 is the kind of task you need. [21:15:09] Make one that looks like that. [21:15:26] What's your labs username? [21:17:24] Zppix [21:18:15] halfak: https://phabricator.wikimedia.org/T168014#3353061 [21:19:29] lol Zppix you pinged your self [21:19:39] 10Scoring-platform-team, 10Operations, 10Ops-Access-Requests, 10User-Zppix: Graphite access for Zppix - https://phabricator.wikimedia.org/T168014#3353092 (10Halfak) [21:19:52] no paladox halfak asked lol [21:20:06] Zppix heh? [21:20:11] i see you ping your self [21:20:14] [22:17:25] Zppix [21:20:51] look above the github notif [21:22:01] ah [21:22:03] lol [21:26:36] halfak: how do you train/make the models like ythe one you edited on github today [21:26:41] 10Scoring-platform-team, 10Operations, 10Ops-Access-Requests, 10User-Zppix: Graphite access for Zppix - https://phabricator.wikimedia.org/T168014#3353061 (10jcrespo) BTW, graphite access is already public: https://graphite.wikimedia.org/render/?target=ores.scb1001.precache_cache_hit.count https://graphite... [21:28:00] 10Scoring-platform-team, 10Operations, 10Ops-Access-Requests, 10User-Zppix: Graphite access for Zppix - https://phabricator.wikimedia.org/T168014#3353137 (10Zppix) @jcrespo when going to graphite.wikimedia.org i get a prompt to login i try every wmf login i have (non wikitech and wikitech and shell ) and n... [21:46:07] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Implement wp10 model for trwiki - https://phabricator.wikimedia.org/T164671#3353213 (10Halfak) https://github.com/wiki-ai/wikiclass/pull/40 [21:47:07] Zppix, you can check out that pull request [21:47:13] Note the lines I added to the Makefile [21:52:08] Zppix lol changes created throw webui are getting corrupted now [21:52:11] throwing errors [21:54:49] halfak: still there? [22:00:39] yes [22:02:31] glorian_wd, ^ [22:03:59] halfak: I am thinking how to move forward with the patch [22:04:05] as Daniel seems against the approach [22:04:12] see: https://gerrit.wikimedia.org/r/#/c/356043/10 [22:06:14] His last comment was addressed in the ticket by me [22:08:06] halfak: ok. Probably I should wait a lil bit before I am sure what I should do. Otherwise, I will keep get -1 from him because he still don't agree with the approach. I will help answering his queries though [22:13:48] scb1001 just had pdfrender enabled so watch ores halfak [22:32:42] glorian_wd, looks like there are a lot of comments that have nothing to do with his core question. [22:32:51] Maybe you could address those in the meantime. [22:33:16] halfak: do you mean comments in Geritt? [22:33:24] such like the aggregate function? [22:34:12] right [22:35:25] halfak: yeah, there is still disagreement between using MAX and AVG [22:35:50] I don't think using MAX and AVG will give so different result though. [22:37:44] 10Scoring-platform-team, 10DBA, 10articlequality-modeling, 10artificial-intelligence: [Discuss] Hosting the monthly article quality dataset on labsDB - https://phabricator.wikimedia.org/T146718#2669300 (10Halfak) Somehow I accidentally added yuvi. :S So I've attempted to load the table once and I had... [22:37:57] glorian_wd, I'm not sure. Maybe you can catch Daniel when he's awake and active to ask for clarity. [22:38:04] * paladox deployed https://gerrit.wikimedia.org/r/#/c/359357/ [22:38:54] halfak: meanwhile, this evening, I think I am close to dumping Wikidata datasource and feature into wikidatawiki.py. Thus, I guess I am close to submitting a PR \o/ [22:39:17] great :) [22:39:33] Alright. I will try to do it this afternoon when I have woken up today. [22:39:40] * glorian_wd it's 00:39 AM here [22:41:08] Staging looks good. [22:41:38] CUSTOM - Host ores-05 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms https://gerrit.wikimedia.org/r/#/c/359357/ worked :) [22:52:32] Deploy is good [22:53:05] 10Scoring-platform-team, 10Wikilabels, 10User-Zppix: Early June Wiki labels deploy - https://phabricator.wikimedia.org/T167061#3316277 (10Halfak) Updated wikilabels with bugfix and tested by getting the target error message and then labeling some edits. Declaring victory. [23:02:34] (03CR) 10Catrope: "The failures are due to:" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/358311 (https://phabricator.wikimedia.org/T155930) (owner: 10Ladsgroup) [23:13:01] 10Scoring-platform-team, 10ORES, 10revscoring, 10artificial-intelligence: Language assets for Tamil - https://phabricator.wikimedia.org/T166052#3353462 (10Halfak) Ready for review. https://github.com/wiki-ai/revscoring/pull/298 [23:13:18] OK good enough for today. Have a good one folks! [23:13:18] o/