[11:43:06] o/ [12:45:51] Fun note: We send around 100 500 errors every 24 hours, it's not much but still better to be addressed [12:50:02] 10ORES, 10Scoring-platform-team, 10User-Ladsgroup: Change default serializer of celery from pickle to json - https://phabricator.wikimedia.org/T206333 (10Ladsgroup) We need to do this but it got deprioritized over very much needed operational improvements. [13:25:45] 10ORES, 10Scoring-platform-team, 10Analytics, 10Dumps-Generation, and 3 others: [Epic] Make ORES scores available in Hadoop and as a dump - https://phabricator.wikimedia.org/T209611 (10JAllemandou) When data gets stored in Hadoop, it is easy to supply pageviews-like dumps files. About how to compute the sc... [13:29:43] 10ORES, 10Scoring-platform-team, 10Analytics: Choose HDFS paths and partitioning for ORES scores - https://phabricator.wikimedia.org/T209731 (10JAllemandou) >>! In T209731#4754979, @Nuria wrote: > It is worth looking at already existing event data, if we want to reuse the logic that reads events and persists... [13:31:13] 10ORES, 10Scoring-platform-team, 10Analytics: Purge ORES scores from Hadoop and begin backfill when model version changes - https://phabricator.wikimedia.org/T209742 (10JAllemandou) Indeed in hadoop there is no such thing as 'in place'. The way to go could be to use model version as a partition-key. You'd ba... [13:34:11] Amir1: I am merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474158/ [13:34:29] I see you +1ed it I guess the comment about "please do not merge this yet" does not apply anymore [13:34:39] akosiaris: yes, thanks! [13:34:43] One thing [13:34:54] let's let it run one node [13:35:01] make sure everything is alright [13:35:10] then moving out to the next node [13:35:53] easy enough [13:38:57] started it, it should take a good 45-50 mins for codfw. Then I 'll proceed to eqiad [13:41:11] akosiaris: Awesome, Thank you so much. I might need to ask you for merging another patch sometime tomorrow [13:44:39] akosiaris: btw. For when you have a little bit of time: 1- What do you think of adding another node to the redis cluster? (oresrdb[12]003) to make it be able to handle sentinel [13:44:39] El búfer 12 está vacío. [13:44:48] Bad robot [13:45:26] 2- you mentioned the next Q in the k8s thread, do you mean Jan-Mar 2019? [13:50:00] (03PS1) 10Ladsgroup: Add check for celery service in scap [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/474690 (https://phabricator.wikimedia.org/T170950) [13:53:37] Amir1: it might be doable I 'll have to take a look at the capacity of the VM clusters [13:54:22] Thanks [13:54:38] the reason is per the doc (https://redis.io/topics/sentinel): "You need at least three Sentinel instances for a robust deployment." [13:55:50] (03CR) 10Ladsgroup: "It would be great if RelEng people take a look this change in scap check scripts. Thank you!" [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/474690 (https://phabricator.wikimedia.org/T170950) (owner: 10Ladsgroup) [14:09:58] akosiaris: Sorry to bother you, Are you around for a deployment of ORES? [14:10:22] Amir1: I would advise against it [14:10:40] the puppet deployment is ongoing [14:10:47] oh okay [14:10:51] Noted [14:11:14] and yes, Jan-Mar 2019 is the answer for k8s [14:11:52] Cool, so it's not super far [14:15:46] akosiaris: you meant for my patch? I checked it on codfw and it works fine [14:16:11] I think we should proceed with eqiad for the puppet patch [14:17:37] Logs are clean [14:17:44] https://logstash.wikimedia.org/goto/2df81c42a76845aba50eae5ac808e55f [14:17:59] * Amir1 dances a little for the logstash support [14:34:57] o/ [14:35:05] Finally back at my desk :) [14:35:12] Actually, my dinner table. [14:38:05] Amir1: proceeding to eqiad now [14:38:19] codfw seems fine indeed [14:41:37] akosiaris: Thanks! After it's done, we probably can deploy this [14:42:52] halfak: I'm going to self merge a code in ores that would break things but going to re-add it ten minutes later (celery4 config changes) [14:42:56] this time around it should be faster. I 've gone from 5 mins between hosts to 2 [14:43:18] Amir1, is this part of the config switch? [14:43:24] yup [14:43:31] OK. I'll stand by. [14:44:16] akosiaris: I have this one as well but after the deployment of some code changes I'm making right now [14:44:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/474694 [14:45:50] halfak: also the patch for scap restart is up [14:46:03] "scap restart"? [14:46:27] celery restart [14:46:52] https://phabricator.wikimedia.org/T170950 [14:47:35] I love how small is number of our high priority tasks now: https://phabricator.wikimedia.org/project/board/1901/query/Q.gJEwm3HfLM/ [14:47:54] Oh. So it checks if the restart was successful? [14:48:57] yup [14:49:05] it's actually pretty easy [14:50:03] * halfak looks for scap restart patch [14:51:07] https://gerrit.wikimedia.org/r/#/c/mediawiki/services/ores/deploy/+/474690/ [14:52:10] (03CR) 10Halfak: [C: 032] Add check for celery service in scap [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/474690 (https://phabricator.wikimedia.org/T170950) (owner: 10Ladsgroup) [14:52:24] (03CR) 10Ladsgroup: [V: 032] Add check for celery service in scap [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/474690 (https://phabricator.wikimedia.org/T170950) (owner: 10Ladsgroup) [14:54:58] btw. Travis is broken due to redis going crazy [15:02:28] (03PS1) 10Ladsgroup: Bump ORES to HEAD [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/474703 (https://phabricator.wikimedia.org/T209587) [15:04:59] 10ORES, 10Scoring-platform-team (Current), 10Patch-For-Review, 10User-Ladsgroup: Migrate ores celery configs to celery 4 - https://phabricator.wikimedia.org/T209587 (10Ladsgroup) https://github.com/wikimedia/ores/pull/291 [15:05:39] (03CR) 10Ladsgroup: [V: 032 C: 032] "Going to deploy this on beta" [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/474703 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [15:07:51] wikimedia/ores#1145 (celery4_configs - efcf36d : Amir Sarabadani): The build failed. https://travis-ci.org/wikimedia/ores/builds/457008532 [15:10:28] ^ It's still the redis weirdness in travis [15:16:07] akosiaris: the change is deployed on beta, works fine there. [15:16:31] let's hammer beta! [15:16:48] handled it well :) [15:31:16] 10ORES, 10Scoring-platform-team (Current), 10Patch-For-Review, 10User-Ladsgroup: Upgrade celery to 4.1.0 for ORES - https://phabricator.wikimedia.org/T178441 (10Ladsgroup) [15:31:19] 10ORES, 10Scoring-platform-team (Current), 10Patch-For-Review, 10User-Ladsgroup: Migrate ores celery configs to celery 4 - https://phabricator.wikimedia.org/T209587 (10Ladsgroup) [15:32:55] 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Travis is failing on master of ORES - https://phabricator.wikimedia.org/T209852 (10Ladsgroup) [15:47:25] okay, it's on prod [15:47:33] there is no celery config there anymore [15:58:57] 10Scoring-platform-team, 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for Galician Wikipedia - https://phabricator.wikimedia.org/T201146 (10Halfak) @Theklan and I just set up euwiki with a gadget entry in user preferences. I think that's probably the best... [16:03:29] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10Halfak) +1. Note that this will affect ChangeProp/precaching too. We need to make sure that hidden models aren't... [16:28:50] Amir1: btw I 'd like to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474450/. Lemme know when there's an idle window of a few mins [16:29:13] akosiaris: sure [16:29:20] I'm around until 6 [16:29:27] 10ORES, 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10Growth-Team: Define a process for adding ORES filters to new wikis when ORES is enabled on those wikis - https://phabricator.wikimedia.org/T164331 (10Halfak) So, I imagined that there would be some public notifications and some statemen... [16:29:31] 10ORES, 10Scoring-platform-team, 10Analytics, 10Analytics-Kanban, and 4 others: Modify revision-score schema so that model probabilities won't conflict - https://phabricator.wikimedia.org/T197000 (10Ottomata) We plan to deploy this Monday Nov 26th. [16:29:50] ok, I 'll do so now then [16:31:35] 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Train/test wp10 model for fawiki - https://phabricator.wikimedia.org/T172629 (10Halfak) [16:34:02] I'm confused about something related to phabricator norms. I saw that awight changed the priority of https://phabricator.wikimedia.org/T209381 from UBN to Normal after the issue was addressed. Shouldn't we leave that task as UBN? [16:34:12] It *was* very high priority. [16:38:16] awight, I'm confused about something related to phabricator norms. I saw that you changed the priority of https://phabricator.wikimedia.org/T209381 from UBN to Normal after the issue was addressed. Shouldn't we leave that task as UBN? [16:38:56] Also, hey! Good morning :) [16:39:23] labels.wmflabs.org is up so the issue is no longer user-facing [16:39:32] I guess that's the reason [16:39:34] Indeed. it seems like it is resolved. [16:39:44] So why not leave the priority and just resolve? [16:39:56] halfak: My thinking was that the issue was still open, but the remaining work was simply to verify the fix and tie up any loose ends. That corresponds to the initial phase of a task, where a normal prio bug might escalate into UBN due to new information. I could be wrong! [16:40:14] I gotcha. I think that is fair. [16:40:29] I was just imagining that it is sub-optimal from a metrics/reporting point of view. [16:40:39] people take UBNs very seriously on their workboards, so I didn't want anyone to look at the task and freak out [16:40:44] E.g. it would be had to ask "How many UBN tasks did the team take on in the last year?" [16:40:54] Is there a reporting use case like you're describing? [16:41:10] If so, it should use "was this task ever UBN" [16:41:15] Not a real use case. More imagined. [16:41:38] Hmm. Not sure if we can search on that. Also, we might pick up a bunch of tasks that were mistakenly set to UBN briefly. [16:41:40] :\ [16:41:48] saurabhbatra: hi! I saw you had pinged the other day... [16:41:54] Really I want "Was it UBN for a substantial period if time?" [16:42:03] awight: hello! :-) [16:42:04] halfak: +1 debouncing [16:42:05] there's gonna be a spike of 500s due to merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474450/ right about now [16:42:06] Anyway. It's no big deal. Just thought I would check on that. [16:42:16] Thanks for the heads up akosiaris [16:42:24] halfak: I think it's a good question, maybe harej wants to comment [16:42:44] awight: sorry for being awol for so long, I was at my parents' for Diwali (which is basically Indian CHristmas) [16:42:47] Oh! We have a sync meeting scheduled in 15 minutes but we decided to skip it on Friday. [16:42:48] What I've usually seen is UBN for the outage and then lowering the priority once the immediate problem is solved [16:43:01] harej, gotcha. So seems like a consistent norm. [16:44:37] Basically ores will be down for a couple of minutes due to redis restart [16:44:49] *cough* we need sentinel *cough* [16:45:00] saurabhbatra: Sounds like a good time! I was worried I might be blocking you... [16:45:15] :| [16:45:18] Amir1: Was there any more conversation about redis sentinel vs. cluster? [16:45:19] 2 minutes? [16:45:27] Maybe we should do a maintenance notice? [16:45:43] awight: nope I'm continuing work, will have updates by end of this week [16:45:47] sentinel is a redis cluster [16:45:54] a type of redis cluster [16:46:24] Amir1: This is late for me to chime in, but is it possible to take the redis servers down one at a time, so the switchover is instantaneous? [16:46:32] We would still lose some jobs, anyway... [16:46:49] that already happened [16:47:23] if everything goes down together, nothing can prevent from it going down [16:47:35] so restarts needs to happen one by one [16:47:55] Amir1: yeah sentinel is one of two alternatives for Redis HA, but with sentinel various stuff like failover has to happen manually, which redis cluster does automatically: https://fnordig.de/2015/06/01/redis-sentinel-and-redis-cluster/ [16:47:59] but depooling for a <10 seconds down time is a little bit weird [16:48:24] sentinel handles failovers automatically [16:48:38] it has an election period [16:49:01] https://redis.io/topics/sentinel [16:49:09] See example 2 and three [16:49:25] you *also* can force a failover if you want to test [16:49:26] it seems like the celery-ores-worker died [16:49:31] okay, good to see that my doc is outdated [16:49:42] What about this > You need at least three Sentinel instances for a robust deployment. [16:50:21] yup, I was suggesting to have the sentinel nodes in client (example 3) but akosiaris rightfully noted it'll make ores and redis tightly coupled [16:50:31] we are thinking of adding a new node per cluster [16:50:50] akosiaris: is it recovering [16:51:05] not very much from where I am standing [16:51:57] ok it's partial [16:51:58] phew [16:52:11] ores1002's celery-ores-worker failed. [16:52:20] it's dead currently I 'll restart it [16:52:21] 9 gbs more memory available in codfw nodes now. That's interesting [16:52:31] codfw is in a more severe state [16:52:35] but also receives way less traffic [16:53:02] there only 3/9 hosts have the celery workers working fine [16:53:19] fwiw https://redis.io/topics/admin#upgrading-or-restarting-a-redis-instance-without-downtime [16:53:26] I 'll force a restart but I think we need to look at logs as to why celery died [16:53:52] ^ if this is celery 3, that was a known bug [16:54:31] everything is back up now [16:56:09] Amir1: okay sentinel seems pretty similar to https://redis.io/topics/cluster-tutorial these days, and has probably gotten much more real-world usage [16:56:51] https://redis.io/documentation > Redis Sentinel is the official high availability solution for Redis. [16:56:54] I'm happy. [16:57:09] Nobody would ever allow a typo like that on their corp site unless it were true ;-) [16:59:03] awight: Also celery4 has official support of sentinel [16:59:41] :100%: thanks [17:00:12] 16:42 <+halfak> Oh! We have a sync meeting scheduled in 15 minutes but we decided to skip it on Friday. [17:00:29] ^ right [17:00:30] I just deleted the event [17:00:42] (y) [17:01:08] ^ used to be a thumbs up in MSN messenger [17:01:30] Or was it ICQ [17:02:42] don't we have a meeting right now? [17:02:44] halfak: awight [17:02:47] halfak: [17:02:50] *harej [17:03:01] Amir1, see scrollback [17:03:08] <+halfak> Oh! We have a sync meeting scheduled in 15 minutes but we decided to skip it on Friday. [17:03:27] PSA cycle complete :) [17:03:36] haha [17:03:37] okay [17:04:08] Amir1, please update scoring_current etherpad [17:04:17] https://etherpad.wikimedia.org/p/scoring_current_work [17:05:23] I should be used to this in tech by now... people begin development on the "nice" solution, a few years later an ad-hoc thing comes along to do the same thing, but it works, so gets adoption, and the "nice" solution never happens [17:09:52] halfak: done [17:11:41] Thanks Amir1 [17:11:43] * halfak reads [17:13:01] dare I ask what the not nice solution was? [17:13:27] apergos: eh I'm just ruminating on redis cluster vs sentinel [17:13:40] ah ha [17:14:03] I guess it's also a story about a monolithic vs. more component-ey architecture [17:14:27] akosiaris: Do you have a little bit of time to deploy this gradually: https://gerrit.wikimedia.org/r/c/operations/puppet/+/474694 ? [17:14:27] awight, what task are should I follow for the work you are mentoring re. paid editing? [17:14:50] halfak: I think we're starting with https://phabricator.wikimedia.org/T120170 [17:15:02] Amir1: in a meeting currently. is tomorrow possible ? [17:15:16] Great! I've got someone with somewhat casual interest who might turn into a collaborator. :) [17:15:20] awight, ^ [17:15:27] Awesome! [17:15:55] akosiaris: sure, let me know tomorrow [17:16:55] 10ORES, 10Scoring-platform-team, 10Analytics: Wire ORES scoring events into Hadoop - https://phabricator.wikimedia.org/T209732 (10fdans) p:05Triage>03High [17:19:16] 10ORES, 10Scoring-platform-team, 10Analytics, 10Dumps-Generation, and 3 others: [Epic] Make ORES scores available in Hadoop and as a dump - https://phabricator.wikimedia.org/T209611 (10fdans) p:05Triage>03Normal [17:19:24] 10ORES, 10Scoring-platform-team, 10Analytics, 10Dumps-Generation, and 3 others: [Epic] Make ORES scores available in Hadoop and as a dump - https://phabricator.wikimedia.org/T209611 (10fdans) p:05Normal>03Triage [17:31:07] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10Ottomata) Since change-prop is responsible for emitting the revision-score event, we'll have to make sure that the... [17:32:05] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10awight) >>! In T206037#4758553, @Halfak wrote: > +1. Note that this will affect ChangeProp/precaching too. We ne... [17:33:18] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10awight) Downstream clients must not assume that `wp10` will be present. Other than that, no changes are required. [17:49:02] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Merge articlequality and itemquality - https://phabricator.wikimedia.org/T206037 (10awight) [17:49:30] 10ORES, 10Scoring-platform-team (Current), 10Growth-Team, 10MediaWiki-extensions-PageCuration, and 2 others: Consolidate articlequality and itemquality models into a "model family" - https://phabricator.wikimedia.org/T206037 (10awight) [17:50:29] 10Scoring-platform-team, 10revscoring, 10Chinese-Sites, 10artificial-intelligence: Chinese language utilities - https://phabricator.wikimedia.org/T109366 (10Halfak) I see. It looks like "和 (hé)" appears in the list of informal words here: https://resources.allsetlearning.com/chinese/grammar/Formal_and_inf... [17:51:34] I cleared one inbox! Wooo [17:51:47] Now onto the next one. [18:04:01] 10Scoring-platform-team, 10revscoring, 10artificial-intelligence: FeatureScalar appears in features list rather than something meaningful - https://phabricator.wikimedia.org/T209869 (10Halfak) [18:07:25] I just started updating models for editquality so I'll take this opportunity to get some lunch [18:12:59] 10Scoring-platform-team, 10revscoring, 10Chinese-Sites, 10artificial-intelligence: Chinese language utilities - https://phabricator.wikimedia.org/T109366 (10Arthur2e5) 和(he2)by itself typically means “and”, so please treat it as a stop word as it is. Do not attempt any filtering on it. Many other formal/i... [19:10:23] 10Scoring-platform-team, 10revscoring, 10Chinese-Sites, 10artificial-intelligence: Chinese language utilities - https://phabricator.wikimedia.org/T109366 (10Halfak) I see. So if I remove those three from the list, then it is otherwise representative of informal language? [19:13:44] I'm afk for dinner, will be back soon [20:28:30] (03PS1) 10Awight: Fix validation for an empty list of endorsements [extensions/JADE] - 10https://gerrit.wikimedia.org/r/474774 [20:57:48] 10JADE, 10Scoring-platform-team (Current), 10DBA, 10Operations, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) [21:00:57] 10JADE, 10Scoring-platform-team (Current), 10Design, 10Patch-For-Review: Come up with view mode for JADE pages - https://phabricator.wikimedia.org/T208819 (10awight) [21:16:16] 10JADE, 10Scoring-platform-team (Current), 10DBA, 10Operations, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10Marostegui) >>! In T196547#4748654, @awight wrote: > This was addressed for now, by an agreement between our team and SRE to not install JA... [21:16:18] 10JADE, 10Scoring-platform-team, 10I18n: Content quality scale translatable strings might not work as implemented - https://phabricator.wikimedia.org/T209884 (10awight) [21:19:28] 10JADE, 10Scoring-platform-team (Current), 10DBA, 10Operations, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) [21:20:22] 10JADE, 10Scoring-platform-team (Current), 10DBA, 10Operations, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) >>! In T196547#4759786, @Marostegui wrote: > There are some other big wikis (commons) where this is also a concern and some other a... [21:21:53] I'm going to take the rest of the day off, sorry for the short notice! [21:27:37] Godspeed for whatever surprise made you need to run. [21:27:45] awight, ^ [21:28:13] Yesterday I learned that school is out for the week... [21:32:55] I'm back [21:33:02] Will work a little and then call it a day [21:40:27] HEY [21:40:42] Woops. Nevermind :) [21:43:21] * Amir1 giggles :P [22:26:01] 10Scoring-platform-team, 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for Galician Wikipedia - https://phabricator.wikimedia.org/T201146 (10Elisardojm) Ah, ok! Thanks! [23:20:31] OK I think that's good for today. I got through almost all of the email. Hopefully I can have the rest of it done in time for actual productive work tomorrow :) [23:20:35] o/ [23:34:35] I'm done for the day too [23:34:36] o/