[05:54:03] 10Analytics, 10MediaWiki-REST-API, 10Platform Team Sprints Board (Sprint 5), 10Platform Team Workboards (Green), 10Story: System administrator reviews API usage by client - https://phabricator.wikimedia.org/T251812 (10Aklapper) a:05eprodromouβ†’03None [06:11:03] 10Analytics, 10SRE: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10MoritzMuehlenhoff) >>! In T278371#6948029, @elukey wrote: > Yep most of the times it works fine, but when the fuse process gets into its weird state then everything trying to access /m... [06:26:39] Good morning [06:29:00] bonjour! [06:53:31] Amir1: o/ if you have a moment, do you know anything about https://phabricator.wikimedia.org/T278665 ? [06:53:39] or who I can ping [07:17:18] 10Analytics, 10WMDE-Analytics-Engineering: wmde-toolkit-analyzer-build.service fails on stat1007 - https://phabricator.wikimedia.org/T278665 (10elukey) Used `jdb` to get more info: ` main[1] print this.counters this.counters = "{property.statements.avg=0.0, item.statements.avg=0.0}" ` The code is https://gi... [07:17:29] TIL jdb, nice [07:41:11] even if I cannot attach to a running jvm for some reason [08:07:00] joal: so for our dear capacity scheduler, should we move to something like production/essential ? [08:07:08] it would also be easier to deploy [08:07:18] elukey: in meeting now, will answer later [08:07:45] ah yes yes sorry! [08:07:55] elukey: from yesterday's meeting, flat queues (instead of hierarchical) looks the way to go [08:10:58] today is mostly meetings :S [08:11:03] I'd have preferred a more structured way for ACLs, it seems a waste to not use it [08:11:37] elukey: we can still use ACLs, giving prod users admin capabilities to prod queues only [08:11:44] (the various prod users) [08:13:00] I know I know [08:18:37] elukey: The idea of not having team queues is actually good - I'm sorry for the overwork of having to change everything :( [08:19:15] joal: I am very angry with you Joseph [08:19:34] ahahhaha please don't say that, it is a big change and it was a good learning experience [08:19:43] :) [08:19:48] I mean we now know how to change queues etc.. easily [08:19:56] true [08:35:18] elukey: morning, hmm, it can be because of migration to the systemd timer but I can't understand how [08:35:39] unless there's a permission issue and the user has been changed. [08:36:26] Amir1: not sure if it ever worked, I had to manually run git-lfs to pull the jar down [08:36:40] now it seems erroring due to an NPE [08:37:50] NPE? [08:38:37] Null pointer [08:39:22] I adde some info the the task, but I have no idea if the code worked before.. I mean, are there any output/metrics/etc.. that WMDE relies on that we can check? [08:40:03] So this is the builder of the jar file, I know I built it manually before [08:40:14] maybe it never worked [08:40:41] it would make sense, the timer is way more reliable in getting failures [08:40:50] awight: o/ do you have a min? [08:43:30] ValueError: Invalid metric name "MediaWiki.TemplateWizard.save.byEditCount.1000+ edits.byWiki.afwiki" [08:46:30] 10Analytics-Radar, 10Patch-For-Review, 10WMDE-TechWish-Sprint-2021-03-31: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10elukey) Applied the patch with some tweaks, it is definitely a cleaner a more modern approach, thanks! I see some errors for the vari... [09:37:58] 10Analytics-Radar, 10Patch-For-Review, 10WMDE-TechWish-Sprint-2021-03-31: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10awight) >>! In T275757#6963571, @elukey wrote: > Applied the patch with some tweaks, it is definitely a cleaner a more modern approach... [09:42:48] 10Analytics-Radar, 10Patch-For-Review, 10WMDE-TechWish-Sprint-2021-03-31: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10elukey) ` elukey@an-launcher1002:~$ ls -l /tmp/parquet-0.log -rw-r--r-- 1 analytics analytics 7521 Apr 1 09:09 /tmp/parquet-0.log `... [09:54:49] 10Analytics-Radar, 10WMDE-TechWish: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) [09:59:27] * elukey bbiab! [10:02:04] 10Analytics-Radar, 10Patch-For-Review, 10WMDE-TechWish-Sprint-2021-03-31: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10awight) >>! In T275757#6963695, @elukey wrote: > This is a very nice confirmation about the new logging style, it seems working :) He... [10:21:47] 10Analytics-Radar, 10WMDE-TechWish: Broken reportupdater queries: edit count bucket label contains illegal characters - https://phabricator.wikimedia.org/T279046 (10awight) I'll apply this transformation at the output, ` select replace(replace('100-999 edits', '+', ' or more'), ' ', '_'); select replace(replac... [10:27:22] (03PS1) 10Awight: Escape edit count bucket for metrics tag name [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676297 (https://phabricator.wikimedia.org/T279046) [10:38:04] (03PS1) 10Awight: Validate the native "hive" report type [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676299 (https://phabricator.wikimedia.org/T193169) [10:38:33] (03PS2) 10Awight: Validate the native "hive" report type [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676299 (https://phabricator.wikimedia.org/T193169) [10:44:00] (03PS2) 10Awight: Escape edit count bucket for metrics tag name [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676297 (https://phabricator.wikimedia.org/T279046) [10:54:56] just tried the wmfdata's ship_python_env=True, really nice! [10:55:38] elukey: Sorry I missed your pingβ€”I've pushed a patch for the error you saw. [10:56:03] awight: hi! yep saw it! It LGTM, but maybe better to wait for mforns ? [10:56:26] just looking at the AQS deploy docs - is this doing regularly https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#NPM_vulnerabilities? There's a few things that would benefit from an update in the service [10:56:39] elukey: +1 sounds right [11:00:24] hnowlan: I don't believe that we do it (I asked in the past but I didn't get how strictly we enforce npm audit) [11:04:42] (03CR) 10Hnowlan: [C: 03+2] package: bump restbase-mod-table-cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/675523 (https://phabricator.wikimedia.org/T278699) (owner: 10Hnowlan) [11:08:19] elukey: fair enough, I think "not very" is the answer. There's one fix that I think we should probably do but I'll put it in another CR [11:13:11] hnowlan: thanks a lot for doing it, really appreciated :) [11:24:12] np! [11:26:43] * elukey lunch! [11:34:58] (03CR) 10jerkins-bot: [V: 04-1] package: bump restbase-mod-table-cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/675523 (https://phabricator.wikimedia.org/T278699) (owner: 10Hnowlan) [12:10:14] 10Analytics-Radar, 10Dumps-Generation: Filename convention is not easy to follow for dumps using a `precombine` step - https://phabricator.wikimedia.org/T279055 (10JAllemandou) [12:17:39] (03PS1) 10Awight: Validate `funnel` deprecation [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/676333 (https://phabricator.wikimedia.org/T193170) [12:19:22] mforns: Hi, next time you find yourself reviewing code, this is the new priority for me: https://gerrit.wikimedia.org/r/c/analytics/reportupdater-queries/+/676297 β€” but in the big picture, it's exciting to be wrecked on the next outcropping of rocks! [12:22:55] (03PS1) 10Andrew-WMDE: ReferencePreviewsPopups: Track anonymous enables/disables [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676334 (https://phabricator.wikimedia.org/T277641) [12:33:30] (03CR) 10Awight: [C: 03+2] "Makes sense!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676334 (https://phabricator.wikimedia.org/T277641) (owner: 10Andrew-WMDE) [12:33:31] 10Analytics: Produce a list of wiki projects ranked by number of eligible voters in Board elections - https://phabricator.wikimedia.org/T278815 (10JAllemandou) Thank you @KCVelaga for the suggestion and link to code :) I assume we should base our rules on the following description: https://github.com/Pathoschild... [12:34:18] (03Merged) 10jenkins-bot: ReferencePreviewsPopups: Track anonymous enables/disables [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676334 (https://phabricator.wikimedia.org/T277641) (owner: 10Andrew-WMDE) [12:48:07] 10Analytics, 10Data-Services, 10Machine-Learning-Team, 10ORES, and 2 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10JAllemandou) Hi @Suriname0 About 1. I prefer to defer to one of our SREs (@Ottomata, @razzi - Any idea?) On 2.... [13:00:48] 10Analytics, 10Data-Services, 10Machine-Learning-Team, 10ORES, and 2 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10Ottomata) > I've tried on a few different machines/networks over the last day, but analytics.wikimedia.org termi... [13:03:14] 10Analytics: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot - https://phabricator.wikimedia.org/T278551 (10Isaac) Thanks @JAllemandou for tracking this down and fixing it! [13:06:22] 10Analytics: Produce a list of wiki projects ranked by number of eligible voters in Board elections - https://phabricator.wikimedia.org/T278815 (10Qgil) I have what I think are good news. :) While it is exciting to get more accurate results and I have been the first one proposing to fine tune the query... What... [13:10:12] 10Analytics: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10elukey) I had a chat with Moritz and this is what I am proposing to do: * leave `dns_canonicalize_hostname=false` on analytics clients and hadoop workers, since they will need it for all use cases... [13:18:26] razzi: Hi :) [13:18:50] razzi: I'm monitoring sqoop, and it looks like we pay some price on using the new setup in perf [13:18:58] 10Analytics, 10Patch-For-Review: Review the usage of dns_canonicalize=false for Kerberos - https://phabricator.wikimedia.org/T278353 (10elukey) Remaining steps: 1) Check that the above patch works fine in Hadoop test (will keep monitoring jobs etc.. for a bit) 2) Remove the dns_canon setting from the roles (p... [13:19:55] joal: can you be more specific? Slower sqoop times? [13:20:07] elukey: hi :) [13:20:15] elukey: I'm trying to get more precise data [13:20:26] elukey: I'll be there in two minutes [13:26:39] elukey: confirmed - in particular, wikidatawiki.revision has taken a much longer time this time than the previous [13:26:56] joal: how long? :) [13:28:00] Almost 5h this time, 1.5h last time [13:28:05] elukey: --^ [13:28:40] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=labsdb1012&var-datasource=eqiad%20prometheus%2Fops&refresh=5m&from=1614464462390&to=1614760060688 [13:28:44] this is 1012 [13:29:33] there is a bit more load on 1021 atm [13:29:56] but maybe the slowdown is due to memory assigned to various mariadb replics [13:29:58] yeah I was looking at that [13:30:12] this was one of the things that we should have tested (I asked about it :) [13:30:13] (load, not various replicas) [13:30:41] * joal remebers having agreed with elukey :) [13:31:24] so wikidata is in s8 [13:32:08] and this is the memory allocated for mariadb buffer pools [13:32:10] s1: 70G [13:32:10] s2: 40G [13:32:10] s3: 40G [13:32:10] s4: 70G [13:32:12] s5: 40G [13:32:15] s6: 30G [13:32:17] s7: 50G [13:32:20] s8: 70G [13:33:02] right - this is a lot smaller than being able to get the whole memory accessed for a big table [13:33:06] anyway [13:33:12] at least it works [13:35:18] the instance should be https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=clouddb1021&var-port=13318&from=now-12h&to=now [13:35:41] elukey: I also have questions around MySQL setup [13:36:32] joal: which ones? [13:36:44] elukey: I need to drop for kids in minutes, and I have a lot of meetings tonight - Will try to investigate in the meeting meatime [13:36:50] ah sure! [13:36:51] elukey: mostly about indices [13:37:08] ah then Brooke or Manuel probably are the best ones to answer, they have set up the whole thing [13:37:49] elukey: for instance, sqooping for commons.content started long after the job said it had started [13:38:14] elukey: this can mean that getting the min and max values for the IDs is very long, this would mean lack of indices [13:38:27] anyhow - more things to investigate [13:50:46] 10Analytics-Clusters: Upgrade the rest of the Hadoop test cluster to Buster - https://phabricator.wikimedia.org/T278422 (10elukey) a:05razziβ†’03elukey [13:51:16] 10Analytics-Clusters: Upgrade the rest of the Hadoop test cluster to Buster - https://phabricator.wikimedia.org/T278422 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-test-worker1001.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reima... [13:56:42] (03CR) 10Ottomata: "This change is ready for review." (031 comment) [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (https://phabricator.wikimedia.org/T272390) (owner: 10Razzi) [14:02:23] 10Analytics-Clusters: Upgrade the rest of the Hadoop test cluster to Buster - https://phabricator.wikimedia.org/T278422 (10elukey) The workers don't have their journalnodes dir in a partition (my bad) so I'll will try to add it asap, and I'll update the reuse recipes. Then an-test-coord1001 will be next, I want... [14:07:59] (03CR) 10Hnowlan: package: bump restbase-mod-table-cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/675523 (https://phabricator.wikimedia.org/T278699) (owner: 10Hnowlan) [14:13:22] (03PS2) 10Hnowlan: package: bump restbase-mod-table-cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/675523 (https://phabricator.wikimedia.org/T278699) [14:16:06] (03CR) 10Hnowlan: [C: 03+2] package: bump restbase-mod-table-cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/675523 (https://phabricator.wikimedia.org/T278699) (owner: 10Hnowlan) [14:16:33] 10Analytics, 10Data-Services, 10Machine-Learning-Team, 10ORES, and 2 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10Suriname0) @JAllemandou thanks for clarifying! I will go without the 2018 data :) @Ottomata It seems to be flat... [14:17:43] (03Merged) 10jenkins-bot: package: bump restbase-mod-table-cassandra [analytics/aqs] - 10https://gerrit.wikimedia.org/r/675523 (https://phabricator.wikimedia.org/T278699) (owner: 10Hnowlan) [14:23:40] (03PS3) 10Ottomata: Include RefineFailuresChecker functionality into RefineMonitor and fix bug in Refine.Config [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 [14:24:07] elukey: FYI I'm rolling the RefineFailuresChecker functionality into RefineMonitor [14:24:13] so we only have to schedule one monitoring job [14:24:14] (03PS1) 10Hnowlan: Update aqs to 4e95573 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/676379 [14:25:15] ottomata: ah didn't know it, nice! [14:25:36] (03CR) 10Hnowlan: "Will be testing this in deployment-prep first" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/676379 (owner: 10Hnowlan) [14:25:38] i was beginning to do the puppet changes to account for the sanitize refactoring, and realized that this will make that easier [14:26:05] ack [14:30:56] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10SBisson) [14:31:50] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban): KaiOSAppFirstRun Event Platform Migration - https://phabricator.wikimedia.org/T267346 (10SBisson) @nshahquinn-wmf ping? ;) [14:35:58] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Sanitize and ingest all event tables into the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10Ottomata) @joal https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/676131 fixes a bug in the validate() stuf... [14:38:51] 10Analytics-Clusters: Upgrade the rest of the Hadoop test cluster to Buster - https://phabricator.wikimedia.org/T278422 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-test-worker1001.eqiad.wmnet'] ` and were **ALL** successful. [14:39:55] (03PS4) 10Ottomata: Include RefineFailuresChecker functionality into RefineMonitor and fix bug in Refine.Config [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 [14:43:08] (03CR) 10jerkins-bot: [V: 04-1] Include RefineFailuresChecker functionality into RefineMonitor and fix bug in Refine.Config [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 (owner: 10Ottomata) [14:44:44] joal: hi! I remember you (maybe I helped?) did a study on how much data would be lost by applying k-anonymity to pageview hourly, was there a spreadsheet? I can not find it.. [14:44:53] (03CR) 10Ottomata: "recheck" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 (owner: 10Ottomata) [14:52:53] (03PS1) 10Jason Linehan: [WIP] Metrics Platform context attribute schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) [15:00:59] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Metrics Platform context attribute schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [15:05:09] hey mforns, fdans, no meeting today? [15:07:59] dsaez weird https://usercontent.irccloud-cdn.com/file/thrXDHx8/Screen%20Shot%202021-04-01%20at%2010.07.41%20AM.png [15:08:23] yes, I have the same msg [15:08:38] dsaez: I think the idea was to pause the meetings until sukhbir is back from paternity leave? [15:09:08] yep, I was not sure, but makes sense [15:24:29] hnowlan: o/ [15:24:49] I saw that you mentioned about testing aqs in deployment-prep, the cluster in there got removed a while ago [15:25:13] so we have now one in our analytics horizon namespace, and to deploy we use git (not scap) [15:25:35] there is not a lot of data in the cluster (maybe zero) so we'll need to add something first [15:29:55] a-team: quick reminder that there's no standup today in favor of the monthly staff meeting [15:35:46] (03PS1) 10Hnowlan: Add makefile and dockerfile for local tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402 [15:47:27] hey dsaez sorry was in a meeting, no.. I cancelled meetings until sukhbir is back, or we designate someone else from traffic to attend [15:47:56] I left a message when cancelling the events, but I should have been more explicit [15:48:19] mforns all good, I've just saw my calendar and wanted to confirm with you [15:48:24] k [15:57:31] (03CR) 10Ottomata: "Thanks! Will follow up with this after we release 1.0.4." (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676081 (owner: 10DCausse) [15:59:32] 10Analytics, 10Data-Services, 10Machine-Learning-Team, 10ORES, and 2 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10Suriname0) I was able to get the 2019 data downloaded on a Toolforge server (connecting via dev.toolforge.org),... [15:59:35] elukey: ahh cool [16:00:21] tbh in that case I was thinking that we could even just deploy the new version to the new hosts alone given that they're not pooled but they do have data [16:00:44] actually no, never mind, that fails to test the actual think I want to test which is backwards compatibility [16:08:04] 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @elukey I have not forgotten about this, A7 is a rack for the possible move but we are already maxing out our power utilization in that rack and addi... [16:18:25] 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10wiki_willy) Hi @Cmjohnson - there should some power freed up, after some mw servers are decom'd for the T273915 refresh. There's going to 7x servers coming out... [16:20:07] 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @wiki_willy That will work! Thanks [16:50:15] !log rebalance kafka partitions for webrequest_text partitions 7 and 8 [16:50:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:53:58] hey razzi, morning :) [16:54:10] I created some tasks the other day for the remaining hadoop buster upgrades [16:54:36] there is one for flerovium/furud, I think it is a good start to learn about reimages that preserve partitions [16:54:49] do you have time during the next days to check it? [16:55:02] Hey there elukey! I'm here and in the staff meeting :) [16:55:15] me too :) [16:56:22] also https://phabricator.wikimedia.org/T276239 seems moving, so you'll have soon-ish 6 hadoop workers nodes to reimage/initialize/add-to-the-cluster [16:56:27] (I'll assist of course!) [16:57:39] Yeah, hopefully I can wrap up superset today and then start to look into upgrades [16:59:40] razzi: ah btw about the procedure - I meant to say that instead of curl | bash you could just list the steps needed that are in the scritp, like echo "some-apt-thing" >> /etc/.." + apt-get install + etc.. It is fine, the important bit is that we don't execute blindly some code [16:59:57] (so we can skip the apt import packages part) [17:00:11] or follow what Andrew said about using docker images with npm on it etc.. [17:00:21] ok whew, I was pretty lost with reprepro [17:01:08] I'd like to figure it out someday, and update the docs, but I told product-analytics I'd get superset 1.0 out yesterday!!! [17:01:33] razzi: yes but today is April first, so we are good! :P [17:02:23] it is ok to go for the deployment, just amend the new scripts/procedure so we'll avoid to forget [17:03:10] the other thing - we currently hack the gamma role to allow it to use sql-lab, have you those steps included in the upgrade procedure after you run the superset init step? (IIRC it clears out all these settings) [17:03:19] otherwise people will not be able to use sqllab [17:03:25] hm no I was unaware [17:04:16] ok so try to test it with a non admin user on staging, to see what happens [17:04:40] I think that superset 1.0 changed the perms names/settings for this, so possibly they have a different name [17:05:14] this contains all the context https://phabricator.wikimedia.org/T249923 [17:05:38] https://github.com/apache/superset/issues/9543#issuecomment-697334398 [17:05:49] sorry not Gamma, Alpha [17:06:04] this is something that we'll likely need to improve [17:07:34] (03CR) 10Ottomata: Image recommendations table for android (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244 (owner: 10Sharvaniharan) [17:11:45] (03CR) 10Joal: "Asking for test, code looks good :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 (owner: 10Ottomata) [17:11:50] (afk for a bit, will read later) [17:13:50] (03CR) 10Joal: "Thanks a lot @hnolan :)" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/676379 (owner: 10Hnowlan) [17:14:57] 10Analytics, 10Data-Services, 10Machine-Learning-Team, 10ORES, and 2 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10Halfak) I wonder if this is related to: {T104004} [17:15:37] joal: unit-test for validate() [17:15:39] (03CR) 10Joal: "One question - This is great - thanks again!" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/676402 (owner: 10Hnowlan) [17:15:40] is that what you are asking for? [17:16:17] ottomata: rather a unit-test for the class-init at large, that would have failed validate [17:16:32] for Refine.Config? [17:16:34] ottomata: I haven't even thought about how complicated that would be [17:16:42] for Refine.Config pretty easy I think [17:16:44] can add that [17:16:50] for Refine on the whole a bit harder [17:17:29] ottomata: a test that would have found the bug you fied is good :) I think it Refine.Config [17:17:31] ojk [17:17:50] pfff- big fat fingers - sorry for typos [17:22:42] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:25:35] 10Analytics, 10Better Use Of Data, 10Gerrit-Privilege-Requests, 10Product-Analytics, 10Product-Data-Infrastructure: Create or identify an appropriate Gerrit group for +2 rights on schemas/event/secondary - https://phabricator.wikimedia.org/T279089 (10Mholloway) [17:25:43] 10Analytics: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot - https://phabricator.wikimedia.org/T278551 (10JAllemandou) I confirm data is fixed for `snapshot=2021-02` - Let's keep this open to remember monitoring next snapshot. [17:26:11] 10Analytics, 10Analytics-Kanban: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot - https://phabricator.wikimedia.org/T278551 (10JAllemandou) [17:34:00] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:37:25] (03PS5) 10Ottomata: Include RefineFailuresChecker functionality into RefineMonitor and fix bug in Refine.Config [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 [17:37:48] (03CR) 10Ottomata: Include RefineFailuresChecker functionality into RefineMonitor and fix bug in Refine.Config (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 (owner: 10Ottomata) [17:39:26] joal: done ^^ [17:41:09] ottomata: shall I merge? [17:53:09] * elukey afk! :) [17:53:24] ottomata: before leaving, if you have ideas about the Yarn queues etc.. lemme know! [17:53:36] otherwise I'll just map 1:1 what we have, removing maybe some queues that we don't use [17:53:49] joal: yes please! [17:54:01] oh elukey want to jump in hangout i'm syncing with razzi right now about superset etc [17:54:08] we can looka t queues real quick? [17:54:16] ohyou are afk nm [17:55:11] (03CR) 10Joal: [C: 03+2] "Thanks @ottomata :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 (owner: 10Ottomata) [17:55:24] ottomata: sure I can join for some mins [17:57:55] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Review the Yarn Capacity scheduler and see if we can move to it - https://phabricator.wikimedia.org/T277062 (10Ottomata) Ok how about: - default - fifo - production - essential With GPU versions of all? [18:02:37] (03Merged) 10jenkins-bot: Include RefineFailuresChecker functionality into RefineMonitor and fix bug in Refine.Config [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/676131 (owner: 10Ottomata) [18:04:36] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Review the Yarn Capacity scheduler and see if we can move to it - https://phabricator.wikimedia.org/T277062 (10JAllemandou) We could do: * fifo - 5% * default - 35% * production - 50% * essential - 10% agreed for GPU for fifo only, with even a li... [18:07:02] (03PS2) 10Jason Linehan: [WIP] Metrics Platform context attribute schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) [18:08:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Metrics Platform context attribute schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [18:09:10] (03PS3) 10Jason Linehan: [WIP] Metrics Platform context attribute schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) [18:09:54] elukey: is there something you want to hand off for ops week now? otherwise we can do tomorrow [18:10:26] mforns: hey! Wasn't Joseph on-call? [18:10:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Metrics Platform context attribute schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [18:10:40] oh elukey! sorry [18:10:49] joal: wanna hand anything off from ops week? [18:10:56] perfect :D [18:10:57] o/ [18:11:02] * elukey afk [18:15:06] ok I confirm I have a culprit for the commonswiki.content table taking long time [18:15:34] And actually, from other as well it seems [18:22:27] (03PS4) 10Jason Linehan: [WIP] Metrics Platform context attribute schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) [18:26:17] (03PS5) 10Jason Linehan: [WIP] Metrics Platform context attribute schema fragment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) [18:52:05] (03PS15) 10Razzi: Upgrade superset to 1.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (https://phabricator.wikimedia.org/T272390) [18:59:14] Yes bstorm :) [18:59:15] Hi [18:59:25] πŸ‘‹πŸ» [18:59:28] bstorm: o/ [18:59:30] What's up? [18:59:36] bstorm: my quesion is on cloudb1021 [19:00:02] πŸ‘‚πŸ» [19:00:10] With the new install, our process getting data is a lot slower [19:00:18] And I might jave a lead [19:00:25] Ah ok [19:00:32] I wanted to confirm [19:01:07] There hasn't been any real tuning yet of the buffers, I will say, and that is the only host that has all 8 sections on it. [19:01:15] What's the theory so far? [19:01:20] or lead at least [19:01:30] My chain of thought: our process happens in 2 stages, first get mion and max of ids, then get the data in splits by splitting between min and max [19:01:53] and even just the first phase seems very long [19:02:13] So I checked on the host, and there no indices on tables from what I can see [19:02:25] There were when I checked [19:02:32] meh? [19:02:48] You may be checking the _p databases [19:02:55] correct! [19:02:57] If so, there are no tables at all. They are just views [19:03:02] right [19:03:05] The indexes are all on the underlying tables [19:03:12] since the views are effectively queries [19:03:15] that was my wonder [19:03:20] ok it makes sense [19:03:22] That's no change from the previous setup [19:03:28] I think I don't have access to the underlying tables [19:03:47] Root has access, but like tool accounts don't [19:03:51] ok - the problem must come from tuning then [19:04:11] The underlying tables are all on the primary database...so like enwiki not on enwiki_p [19:04:20] enwiki_p will have no tables at all [19:04:29] thanks a lot for infirming my idea bstorm, this helps [19:04:33] but all views are queries on enwiki [19:04:38] Np [19:05:07] we're gonna let the current jobs finish, then we'll come to ou again I think and talk about tuning [19:05:08] I believe all databases are set up with pretty small buffer values (*checks what that buffer is called*) [19:05:10] bstorm: -^ [19:05:30] bstorm: thanks for the fast answers [19:07:57] np! It may be possible to adjust your chunk sizes to the tuning (which should be hosts hiera for it) or to shuffle the tuning according to section (which marostegui would probably be better help with) [19:09:13] ack bstorm - we'll deep dive when we'll be able to test - creating a ticket [19:09:35] Ok cool. [19:13:58] 10Analytics, 10Cloud-Services, 10Data-Persistence (Consultation): Sqoop on multi-instance clouddb1021 is very slow for some tables - https://phabricator.wikimedia.org/T279095 (10JAllemandou) [19:14:43] ok, gone for now - thanks again bstorm (and sorry for the multi ping :) [19:16:26] πŸ‘‹πŸ» [19:23:59] (03CR) 10Jason Linehan: [WIP] Metrics Platform context attribute schema fragment (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [19:50:04] Alright, going to start the superset 1.0 deployment now [19:50:55] (03CR) 10Razzi: [V: 03+2 C: 03+2] Upgrade superset to 1.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (https://phabricator.wikimedia.org/T272390) (owner: 10Razzi) [19:52:13] (03CR) 10Razzi: [V: 03+2 C: 03+2] Upgrade superset to 1.0.1 (031 comment) [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/665130 (https://phabricator.wikimedia.org/T272390) (owner: 10Razzi) [19:54:04] !log dump superset production to an-coord1001.eqiad.wmnet:/home/razzi/superset_production_1617306805.sql just in case [19:54:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:55:38] an-worker1080 is down since a while. I told dcops about it [20:01:04] !log sudo chown -R analytics_deploy:analytics_deploy /srv/deployment/analytics/superset/venv since it's owned by root and needs to be removed upon deployment [20:01:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:14:53] !log manually run bash /srv/deployment/analytics/superset/deploy/create_virtualenv.sh as analytics_deploy on an-tool1010, since somehow it didn't run with scap [20:14:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:15:39] I'm running into a migration issue with superset, which I think will be fixed if I restore from the database dump I took 20 minutes ago, so I'm going to restore from that [20:19:04] 10Analytics, 10Product-Analytics: Hive Runtime Error - Query on event.MobileWikiAppDailyStats failing with errors - https://phabricator.wikimedia.org/T277348 (10nshahquinn-wmf) >>! In T277348#6959373, @SNowick_WMF wrote: > will try the spark-sql query next time I need to run these queries. @SNowick_WMF for w... [20:27:50] !log restore superset_production from backup superset_production_1617306805.sql [20:27:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:28:33] razzi: I happened to see your comment about the db dump, and now Icinga is alerting on -production [20:28:47] about the superset DB and replication [20:29:08] duplicate column name [20:29:22] mutante: ok, it should be fixed now, let me look at that channel [20:29:57] maybe you can click "reschedule next service check" and see it go away [20:30:50] I clicked the "acknowledge this alert with a short-lived silence" [20:31:25] where do you see that? [20:32:32] alerts.wikimedia.org [20:33:13] hmm, ok. that doesnt seem to work [20:35:12] I don't know how we are suppoed to use it, but it does nothing for Icinga alert. still CRIT and not acked there.. dunno [20:35:51] I use icinga.wikimedia.org and will click there to ACK it [20:36:22] mutante: ok thanks for acking [20:37:06] yep, np. I just happened to be on it for other reasons [20:37:15] and there is that one analytics host that is down, by the way [20:37:39] I wonder if that is reported on alerts at all [20:37:47] an-worker1080 is down [20:40:16] 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10Bstorm) [20:40:33] 10Quarry, 10cloud-services-team (Kanban): Do some checks of how many Quarry queries will break in a multiinstance environment - https://phabricator.wikimedia.org/T267989 (10Bstorm) 05Openβ†’03Declined If we can get the tally numbers on this ticket it would be good, but at this point, I think this is a moot i... [21:27:05] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 2 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Mholloway) The fixed patch finally rolled out on Tuesday 3/30 with 1.... [22:34:59] 10Analytics-Clusters, 10Analytics-Kanban: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10razzi) Ok, I released superset 1.0! I'll keep this open for now, for reporting any regressions.