[07:20:46] 10Scoring-platform-team, 10Wikilabels: Wikilabels uses deprecated api calls - https://phabricator.wikimedia.org/T170758#3447160 (10AnotherLadsgroup) TLDR is that this call should be something like this: https://nl.wikipedia.org/w/api.php?callback=jQuery213048846344104261585_1500159918758&action=compare&fromrev... [08:12:10] 10Scoring-platform-team, 10Wikidata, 10editquality-modeling, 10User-Ladsgroup, and 2 others: Add basic bad word check to Wikidata feature set - https://phabricator.wikimedia.org/T170834#3447212 (10AnotherLadsgroup) [08:12:58] 10Scoring-platform-team, 10Wikidata, 10editquality-modeling, 10User-Ladsgroup, and 2 others: Add entropy-related and uppercase-related measures to comments - https://phabricator.wikimedia.org/T170835#3447215 (10AnotherLadsgroup) [09:28:32] 10Scoring-platform-team, 10MediaWiki-JobQueue, 10ORES, 10Performance-Team, and 5 others: Job queue corruption after codfw switch over (Queue growth, duplicate runs) - https://phabricator.wikimedia.org/T163337#3447299 (10elukey) 05Open>03Resolved a:03elukey [10:50:39] PROBLEM - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores is alerting: 5xx rate (Change prop) alert. [11:03:39] RECOVERY - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/ores is not alerting. [11:11:29] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 980 bytes in 0.057 second response time [11:37:39] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 941 bytes in 0.519 second response time [11:41:49] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:42:49] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 938 bytes in 8.031 second response time [12:00:59] PROBLEM - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores is alerting: 5xx rate (Change prop) alert. [12:13:09] RECOVERY - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/ores is not alerting. [14:01:29] 10Scoring-platform-team, 10editquality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Flagged revs approve model to fiwiki - https://phabricator.wikimedia.org/T166235#3448058 (10Zache) > I'm unsure whether the reverted query is getting what we intended. When does "rv" get added to the ChangeTags, a... [14:02:35] o/ [14:02:47] I got some messages from Amir1 talking about the outage. [14:02:52] Anyone have more information? [14:02:53] akosiaris, ? [14:03:48] halfak: yeah we are not sure what has happened yet [14:04:02] we failed over the ores database for kernel upgrades [14:04:16] The cache? [14:04:18] and while we were waiting for the old one to drain uwsgi started spewing 503s [14:04:19] both [14:04:28] cache and queue are the same box [14:04:35] we reverted and are ok now [14:04:46] but it's unclear yet what happened [14:05:03] uwsgi logs are not helpful [14:05:14] OK so just to be clear, there was a kernel upgrade and a scheduled reboot, right? [14:05:20] nope [14:05:32] we had to abort that [14:05:35] "we failed over the ores database for kernel upgrade" [14:05:40] Ok so there was going to be [14:05:45] Did it start? [14:05:46] yes [14:05:56] it ? [14:05:58] OK. I'll see what I can figure out [14:06:10] Looks like eqiad was hit hard but codfw was OK [14:06:15] I 've done that process btw quite a few times [14:06:16] The kernel upgrade [14:06:28] the kernel upgrades was never done on the master [14:06:37] Did it start? [14:06:42] the slave had been done 1 hour earlier [14:06:57] OK and it looks like the problem started after the slave rebooted? [14:07:17] more like after we switched over to the slave [14:07:29] the slave had rebooted, 30 mins + had passed [14:07:32] everything was looking peachy [14:07:42] then we switched over the DNS for oresrdb.svc.eqiad.wmnet [14:07:53] and then 503s started being emitted from uwsgi [14:08:03] switching over the DNS? [14:08:07] we failed over immediately and this subsided and returned to normal [14:08:19] yes oresdb.svc.eqiad.wmnet is a CNAME to oresrdb1001 [14:08:28] So you switched the slave to the master? [14:08:29] when we want to do maintenance on oresrdb1001 [14:08:36] we switch to orerdb1002 [14:08:41] oresrdb1002 [14:08:44] OK I think I understand now. [14:08:56] We 've done it quite a few times already up to now [14:09:02] In ORES? [14:09:06] yup [14:09:08] kk [14:09:27] So we expect connections to follow the DNS change without getting a signal? [14:09:54] no, sending the signal is easy [14:10:16] in fact we were planning on waiting out for connections to normally terminate this time around [14:10:30] What is the signal? [14:10:31] usually I just reload uwsgi+celery [14:10:36] Oh gotcha. [14:10:38] SIGHUP [14:10:42] So that's something that is new this time. [14:11:08] we 've been trying to reproduce it a bit [14:11:15] we 've switched just scb1001 [14:11:20] nothing bad happened [14:11:36] orerdb1002 is still in the same state that it was? [14:11:37] a few 503s right after the uwsgi restart but nothing more [14:11:43] yes [14:11:57] OK so maybe I'll inspect redis on there [14:12:16] per https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=now-6h&to=now-1m at least [14:12:22] number of keys hasn't changed [14:12:31] memory usage is the same [14:13:11] the drop and then rise in the redis connected clients graph is us doing the failover and failback [14:13:59] Looks like I can't ssh to oresrdb1002.eqiad.wmnet [14:14:21] quite possibly you 've never had to up to now [14:14:38] OK. I should be able to connect to it from scb1001 [14:14:46] yeah I was going to suggest that [14:17:45] OK so... this looks like it is working. [14:17:51] I wonder if celery entered a bad state. [14:18:11] We were hoping to do a deployment today that would include new signal-based timeouts. [14:18:13] it got out of it pretty easily on the DNS revert if it did [14:19:18] yeah I 've seen that. It would be useful indeed [14:19:27] especially in cases like that regexp mess [14:19:38] Well, celery *is* the broker/backend it is connected to. No other state. [14:21:17] Maybe there was some blocking operation that failed during the DNS switch. [14:21:41] I'm not familiar enough with celery internals, but I imagine that accessing the main queue would be the blocker. [14:22:29] You'd think that redis wouldn't replicate in a way that'd allow that... [14:23:00] don't assume too much from redis replication [14:23:08] we 've seen it cause issues a few times already [14:23:18] kk [14:23:19] the worst is when using lua scripting [14:23:35] So maybe we need to update our own protocol for these sorts of things. [14:24:02] actually one thing that would solve it very nicely would be to support nutcracker [14:24:14] or twemproxy as it is called these days [14:24:33] but that is blocked due to ORES using the MULTI command [14:24:36] akosiaris, what's going on with the weird cache hit rate spikes in codfw? [14:24:47] Don't let me interrupt, but halfak: is there still some beta deploying and testing to do this morning? [14:24:53] halfak: I was wondering about that too and 've been meening to ask you [14:25:15] they seem to be always happening [14:25:26] awight, right now, I'm trying to figure out the status and cause of an hour long outage we recovered from recently. [14:25:27] for example https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=now-7d&to=now-1m [14:26:06] Well this is all deeply concerning [14:26:32] halfak: very nice. k I'll try to do beta before commuting. [14:26:49] ok [14:27:48] so, if ORES could support nutcracker/twemproxy the above would not even happen [14:27:56] we wouldn't even have to switchover stuff [14:28:10] both databases would be used, content would be sharded [14:28:17] there would be no master/slave [14:28:27] akosiaris, I'm far more concerned with thinking about what's going on with codfw's cache right now [14:28:39] and we would be free to do maintenance on hosts without problems [14:28:42] We just had eqiad's cache explode and codfw is behaving badly [14:29:03] Maybe we could solve that problem before I start digging through celery internals for an explanation for why it needs multi. [14:29:19] Which will prevent us from using tewmproxy apparently. [14:29:39] yeah codfw's cache hit rate has been like that for ever [14:29:51] When did you notice this? [14:29:59] about a few hours ago [14:30:09] kk [14:30:11] while I was debugging the eqiad stuff [14:30:38] * awight bookmarks akosiaris's suggestion to read and understand this afternoon [14:35:26] looks like changeprop isn't hitting codfw now. [14:35:33] Or it is doing so in a weird way. [14:36:02] Totally failing to share a graphite graph [14:36:16] What does clicking on "Short URL" even do? [14:37:15] https://grafana.wikimedia.org/dashboard/db/ores?panelId=4&fullscreen&orgId=1&from=now-30d&to=now-1m [14:37:23] Meh. Just look at codfw in this graph [14:37:42] PROBLEM - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores is alerting: 5xx rate (Change prop) alert. [14:38:08] Here we go again [14:38:47] akosiaris, ^ was that you? [14:39:22] could be, I reloaded uwsgi and restarted celery on scb1001 [14:39:36] I can't see why it would croak like that though tbh [14:39:46] Any 500s will trigger that [14:39:57] (more than 5 per minute or something like that [14:40:19] I tried not to cause anything.. I 've actually just reloaded uwsgi [14:40:26] instead of restarting it [14:40:27] So we can tell that uwsgi was blocking on celery because of this: https://grafana.wikimedia.org/dashboard/db/ores?panelId=13&fullscreen&orgId=1&from=now-24h&to=now-1m [14:40:36] (during the outage) [14:40:58] ah, I know what happened [14:41:02] I just put 2+2 together [14:41:11] sigh... [14:41:14] \o/ [14:41:15] so here's my theory [14:41:17] knowing is so awesome [14:41:18] kk [14:41:25] DNS change goes live [14:41:25] * halfak is glued to screen [14:41:39] some uwsgi processes pick up the change, but not all [14:41:52] no celery processes pick up the change through [14:42:08] so the "new" uwsgi processes try to get stuff from oresrdb1002 [14:42:17] but no celery worker is there to service them [14:42:23] and they are blocked on it [14:42:58] celery queue fills up and everyone gets sad and the 503s start [14:42:59] when some celery processes picked up the change, they were probably too few to service all the new uwsgi processes [14:43:06] exactly [14:43:18] * awight mops brow--I just attempted production deployment, luckily was stopped by an auth failure. [14:43:26] beta deployment is tin on *wmflabs* [14:43:33] nasty... [14:43:39] so, this is easy to avoid next time around [14:43:42] please copy-paste commands from wiki page :P [14:43:46] the moment the change goes live [14:43:47] akosiaris, oh? [14:43:50] I force a reload [14:44:25] when/if we get to support nutcracker/twemproxy this won't even be needed [14:45:05] akosiaris, I'm guessing that's not going to happen unless we want to monkey patch celery to not use multi -- but there's likely a good reason it does. [14:45:28] yeah I 've recently cloned it and started looking at the code [14:45:30] * halfak double checks out simple cache access for any multi-smells [14:45:48] then got distracted by who knows what [14:45:54] need to have a look at it again [14:45:57] lol. [14:46:13] in our cache we call redis.get() and redis.setex() -- that's it. [14:47:12] hmm so it must be celery then [14:47:21] yeah [14:48:32] I'm seeing the same error that I got on production tin: [14:48:33] > 14:47:54 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'worker', 'fetch', '--refresh-config'] on deployment-sca03.deployment-prep.eqiad.wmflabs returned [255]: Agent admitted failure to sign using the key. [14:48:50] No idea what that means. [14:48:55] I've not seen that before. [14:49:14] I'll ask in -releng [14:49:18] But FWIW, I'm stuck on a key error for pulling down the draftquality repo [14:49:34] Which doesn't make any sense because phab ssh keys are a global thing for a user. [14:49:41] Can pull down the rest of the repos. [14:49:43] RECOVERY - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/ores is not alerting. [14:49:50] thank icinga-wm [14:49:52] *thanks [15:00:43] akosiaris, thanks for thinking this through me today. I'll keep an eye on the codfw cache stuff. [15:01:09] halfak: thanks as well [15:01:32] I still don't feel settled about what is causing the issue :/ [15:01:43] btw the codfw cache stuff might even make sense [15:02:02] it's only receiving changeprop + some external queries [15:02:05] Right. been trying to check out the access patterns and rate of Precaching [15:02:23] and not the mediawiki originated requests [15:03:17] akosiaris, gotcha. That'll make a difference too. [15:03:46] Also, there's a bit of funny timing that might be at play. If a request gets deduped for a celery result, that still is marked as a cache miss because it might wait a bit. [15:03:56] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3448328 (10awight) Staging deployment complete, and seems to work. Beta deployment complete, needs smoke testing. [15:04:06] halfak: fyi ^ [15:04:16] Thanks awight [15:04:26] Amir1 is our master of smoke testing beta [15:04:37] We should have some docs on checking if the beta extension deployment is happy. [15:06:54] Amir1: If you have time to ^ smoke test beta, lmk! [15:07:00] * halfak is now officially in quarterly review mode [15:07:09] * awight facepalms [15:07:15] but will continue to lightly monitor here [15:08:27] Beta is releasing the magic smoke. [15:15:12] --Grumble-- I can't even find which instance corresponds to ores-beta.wmflabs.org [15:15:39] ... in neither the ores nor ores-staging projects! [15:15:56] ores-beta is in the beta project [15:16:04] :) k [15:16:11] See the scap files [15:16:14] in the repo [15:16:17] for which machines [15:16:50] I don't see that project... https://tools.wmflabs.org/openstack-browser/project/ [15:16:55] k will read the scap files. [15:18:23] I'm in [15:21:46] nice [15:23:19] ping it [15:23:23] or host [15:24:24] or https://tools.wmflabs.org/openstack-browser/proxy/ [15:25:00] > [2017-07-18T15:24:38] enchant.errors.DictNotFoundError: Dictionary for language 'el' could not be found [15:25:01] deployment-sca03.deployment-prep.eqiad.wmflabs [15:25:07] awight halfak ^^ [15:25:15] deployment-prep ores-beta.wmflabs.org http://10.68.21.183:8081 deployment-sca03.deployment-prep.eqiad.wmflabs [15:25:16] looking [15:25:29] paladox: Yes! Sorry, I did manage to find the box, that is the one. [15:25:35] ok :) [15:25:50] https://github.com/wikimedia/puppet/blob/d6592318263083a9db135133ac04dbf31c110891/modules/ores/manifests/base.pp#L22 [15:25:54] paladox: Thanks again for getting me on the right track with the Phabricator key [15:25:57] Is puppet running on that machine? [15:26:04] your welcome :) [15:26:26] halfak: checking [15:30:25] Failed to apply catalog: Found 1 dependency cycle: [15:30:26] (Exec[recommendation_api config deploy] => Service::Node::Config::Scap3[recommendation_api] => Scap::Target[recommendation-api/deploy] => User[deploy-service] => Exec[recommendation_api config deploy]) [15:36:13] Looks like someone else broke puppet? [15:36:28] ah you're right, that's wdqs [15:36:40] will mention in ops [15:37:03] PROBLEM - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores is alerting: 5xx rate (Change prop) alert. [15:37:54] that one ^ is not me [15:38:03] We're not overloaded [15:38:23] Hmm... We should see a drop in 500s with the next deploy [15:38:35] Right now, we 500 on a param error or something that should 404 [15:38:37] :/ [15:38:58] :( [15:39:28] We have fixes for those if we can get past this puppet issue and deploy in a couple hours. [15:40:13] RECOVERY - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/ores is not alerting. [15:42:39] Actually... neither of those issues should affect this precaching issue. [15:42:47] I'd like to see what 500's precaching is getting [15:42:58] *changeprop [15:43:47] At this point, I need to commute, but I could potentially * install the package on beta manually, and * confirm whether puppet installed the package on production already [15:44:03] Go commute. [15:44:08] I'll see what I can do. [15:44:13] Not presenting in QR anymore [15:44:18] paladox: That machine you found... if you get the chance, we could use a puppet freshness alert for it. [15:44:21] aha [15:44:25] tag, thanks [15:44:37] is that in deployment-prep? [15:44:45] I think so, yeah [15:44:58] that's releng. They said no to zppix request for icinga2 monotoring anyways. [15:45:08] maybe you want to do it for that instance only [15:45:09] rats [15:45:21] I'll see if we can get icinga1 on that, then [15:45:22] awight though we could possibly do it for this instance [15:45:34] editing the hiera for that instance only [15:45:49] awight icinga1 does not monotore labs [15:46:31] awight add this "profile::base::nrpe_allowed_hosts": 127.0.0.1,10.68.23.211 to that hiera file for [15:46:41] deployment-sca03.deployment-prep.eqiad.wmflabs [15:48:57] on the road for an hour... [15:49:13] PROBLEM - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores is alerting: 5xx rate (Change prop) alert. [15:49:21] WTF [15:49:51] Just switches from eqiad to codfw [15:51:15] halfak: got a minute? [15:51:23] Sorry not right now. [15:51:34] In a meeting and working on a deployment and working on production issues :) [15:51:45] ok, will ask later :) [15:55:04] o/ halfak [15:55:38] I don't know where these 500s are coming from. [15:55:41] All looks fine. [15:55:54] I can even re-run the request that apparently returned a 500 [15:56:17] 500s? ores again? [15:56:49] Not an "again" situation [15:57:16] whats going on now? [15:57:27] It' [15:57:40] Sorry to not want to talk about it while I'm looking into it [15:59:31] halfak no worries let me know when you can :) [16:00:23] RECOVERY - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/ores is not alerting. [16:02:02] Amir1 when you get a chance can you help me figure out how to fix https://github.com/wiki-ai/wikilabels/pull/194 [16:27:04] OK beta deploy is fine with puppet work around. [16:30:04] 10Scoring-platform-team-Backlog, 10ORES: ORES deployment finish "successfully" even when uwsgi and celery fail to successfully start up - https://phabricator.wikimedia.org/T170950#3448625 (10Halfak) [16:57:29] halfak: I see some backscroll, what was the puppet workaround? [16:57:41] to the point--will it work on production? [17:00:34] back in 5 min. [17:07:20] halfak: I'm in the dark... Holler when you're back. [17:08:19] Amir1: Do you know what we're doing about the deployment today? I can't tell if we're on track or not, due to some puppet unpleasantness that might prevent the prod boxes from installing the Greek dictionary. [17:10:19] awight: halfak I was afk for dinner [17:10:26] scrolling back [17:10:33] That's the right way to do dinner :) [17:11:18] :))) Not the most efficient way for Wikipedia, My preferred one is work with one hand and eat with the other [17:11:31] :-[ [17:11:45] Just don't get confused, I guess [17:11:46] awight, hey [17:11:59] Sorry stepped away for lunch and it took longer than expected. [17:12:01] * awight bites the mouse [17:12:10] My work around was "sudo apt-get install aspell-el" [17:12:22] halfak: That's an appropriate stress response :-p [17:12:22] and we need to makes sure that puppet is running on scb nodes. [17:12:56] halfak: yeeah are you planning to do that nonsense on production, though? lemme check what their status is currently [17:13:12] awight: Okay, first, yes it's sca03 we should note that somewhere though [17:13:39] * awight squints hard [17:13:44] the workaround is to install it, when puppet is back up, it just skips and nothing breaks \o/ [17:13:58] awight, if puppet is running we won't have to [17:14:19] halfak: I think puppet in beta is broken [17:14:29] right [17:14:41] Amir1: agreed, wdqs recommendation_api has a dependency cycle [17:14:42] to deploy in beta, the only thing we should do is to go to deployment-tin and then do scap deploy [17:14:44] from scb1001: The last Puppet run was at Tue Jul 18 17:10:03 UTC 2017 (4 minutes ago). [17:15:34] http://tyler.zone/changeprop-cycle.png [17:16:09] Amir1, we did that. Puppet was broken so we were missing an apt package. [17:16:22] Verified that aspell-el is installed on scb1002 [17:16:27] I manually installed the same package that apt calls for and everything was fine [17:16:28] nice [17:16:32] So shall we go ahead with production deployment? [17:16:33] I think we are good to go. [17:16:39] k. I'll drive then? [17:16:39] Yes. [17:16:44] Yes plz. :) [17:16:47] great [17:16:48] halfak: did you do that in prod cluster? [17:17:00] Amir1, do what? [17:17:12] "I manually installed the same package that apt calls for and everything was fine" [17:17:18] puppet isn't broken in prod [17:17:25] yeah, I got confused [17:17:28] see awights comment about aspell-el [17:17:30] kk [17:17:44] awight, https://wikitech.wikimedia.org/wiki/ORES/Deployment#Production_cluster_.28ores.wikimedia.org.29 [17:17:50] Want to get on a call just in case? [17:18:07] sure [17:18:28] https://hangouts.google.com/hangouts/_/wikimedia.org/ores [17:18:44] * halfak is there [17:19:52] I need to work on wikidata atm, :(((( [17:20:03] no worries, Amir1 [17:20:18] halfak: I'm around if anything goes wrong [17:20:40] even though I don't have any rights to do anything but still I can help with some stuff [17:26:26] halfak: https://el.wikipedia.org/wiki/%CE%A0%CF%8D%CE%BB%CE%B7:%CE%9A%CF%8D%CF%81%CE%B9%CE%B1?uselang=en [17:30:26] Canary is good. Moving forward [17:31:32] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3448916 (10awight) Beta works. Production deployment is in progress, the canary checked out OK. [17:31:38] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3448917 (10Halfak) a:03awight [17:31:51] ^ since awight is doing the deploy, should probably be assigned to him [17:32:08] * awight adjusts albatross [17:32:31] 10Scoring-platform-team, 10ORES: Update revscoring to 1.3.17 in wheels repo - https://phabricator.wikimedia.org/T170713#3448919 (10Halfak) 05Open>03Resolved [17:32:48] Damn looks like codezee is gone [17:32:56] first chance I had to actually work with him [17:34:43] time zone issues [17:35:52] yeah :/ [17:39:27] halfak: I added three more features to wikidata model and roc-auc went down from 99% to 89% how that's possible [17:39:50] Amir1, wow that's surprising. Did you re-tune? [17:40:08] nope but I thought it should work either way [17:40:09] Exact same dataset? [17:40:13] let me re-tune [17:40:33] btw, did we ever re-train the fawiki model on new data? [17:40:49] nope [17:40:53] we should do it [17:40:55] someday :D [17:41:44] Should be easy :) [17:42:00] I keep forgetting it, let me make a card [17:43:08] 10Scoring-platform-team, 10editquality-modeling, 10artificial-intelligence: Add new data for damaging models of Persian Wikipedia - https://phabricator.wikimedia.org/T170960#3448976 (10AnotherLadsgroup) [17:43:24] Do you all get email about icinga-wm alerts? [17:43:35] I need to += CC [17:44:56] Yeah. There's a list. I know where it is referenced. [17:45:34] https://github.com/wikimedia/puppet/blob/a7189380d06718a6c68bad2e2127c7da9d369ebb/modules/nagios_common/files/contactgroups.cfg#L62 [17:45:51] Actually, I'm seeing that you're both being notified by email. Can we just point this sort of thing to our private email addy? [17:45:52] Looks like we need to add awight there [17:45:56] Want to start patchset? [17:46:09] awight, hmm... not a bad idea. [17:46:18] I will en-patchset as the list, then [17:46:41] Amir1 is getting SMS which is saucy. I'll leave that one alone. [17:47:16] I don't get sms, the service was super expansive :D [17:47:34] https://www.irccloud.com/pastebin/defCSOx4/ [17:48:18] weiiiird [17:48:23] halfak: I have a feeling that the data has been changed over the years [17:48:31] Cna you try to do the same tuning with old features? [17:48:32] yeah! [17:48:32] probably pages got deleted, etc. [17:48:48] double check that old stat ;) [17:48:57] yeah. That's my plan but first let me check if the current data is okay [17:50:07] kk :) [17:50:12] the data looks sane [17:50:16] I'm skeptical that all that change is just deletions. it will be good to know. [17:50:30] PROBLEM - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/ores is alerting: 5xx rate (Change prop) alert. [17:50:56] ^ might be deployment related. [17:51:13] it's restarting services at the moment [17:51:54] makes sense then [17:52:30] Just realized we are due update ChangeProp to use the new endpoint [17:53:00] How does that interact with this deployment? [17:54:22] halfak: hmm, I think the lower roc-auc is related to CV validation [17:54:37] because on the old features it's around 87% [17:55:13] tuning atm [17:55:43] Scap is complete. [17:58:44] Hmm... That *shouldn't* be right, but I guess it must be. [17:58:54] The training process does cross-validation too [17:59:00] Both tuning and cv_train [17:59:09] awight, nice! [17:59:11] https://www.irccloud.com/pastebin/aJER2kgL/ [17:59:29] I wonder if it is doing something weird with roc_auc [18:00:01] Anyway, it's improving roc-auc by 0.3% :D [18:03:09] :D That's good to know. Do we have an old tuning report with better roc_auc scores? [18:03:39] RECOVERY - https://grafana.wikimedia.org/dashboard/db/ores grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/ores is not alerting. [18:06:30] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10artificial-intelligence: ORES deployment - Mid July, 2017 - https://phabricator.wikimedia.org/T170485#3432972 (10awight) 05Open>03Resolved [18:07:47] halfak: I think it should be up in the repo (git history even if we merge this one) [18:08:08]