[00:56:11] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:50:03] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation, 10Okapi, 10Platform Engineering: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10RBrounley_WMF) Yes, thanks @ArielGlenn
[01:50:32] <wikibugs>	 10Analytics-Radar, 10Dumps-Generation, 10Okapi, 10Platform Engineering: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10RBrounley_WMF) 05Open→03Resolved
[06:51:42] <wikibugs>	 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10ayounsi)
[07:03:17] <wikibugs>	 10Analytics: Upgrade Druid to 0.20.1 (latest upstream) - https://phabricator.wikimedia.org/T278056 (10elukey) Changelogs:  https://github.com/apache/druid/releases/tag/druid-0.20.0 https://github.com/apache/druid/releases/tag/druid-0.20.1
[07:03:59] <elukey>	 good morning
[07:50:58] <joal>	 Good morning
[07:51:04] <joal>	 elukey: Hi :)
[07:51:14] <joal>	 I'm gonna read some more on Alluxio this morning
[07:51:27] <joal>	 elukey: and thank you for the task about druid-0.20 :S
[07:51:39] <elukey>	 joal: bonjour! 
[07:52:00] <elukey>	 at least we know what it was, those cache misses now make more sense!
[07:52:14] <joal>	 indeed :)
[07:52:18] <elukey>	 upgrading to 0.20.1 shouldn't be  that hard, but of course it is a plus maybe for next Q
[07:52:28] <joal>	 yup
[07:52:38] <elukey>	 (even if I'd do it now :P)
[07:52:58] <elukey>	 thinking about a broker cache in memcached/redis could be a nice idea
[07:53:27] <elukey>	 even one node
[07:53:38] <joal>	 sure elukey - I wonder about overall gain in comparison to hardware cost
[07:54:37] <elukey>	 ah yes that needs to be taken into consideration
[07:55:37] <elukey>	 I am wondering how whole cache working could affect druid-based dashboard loading
[07:55:50] <elukey>	 in theory it should be a good use case
[07:55:55] <joal>	 yup
[07:56:20] <joal>	 On the other hand, we're moving away from druid for many cases, prefering Presto
[07:57:18] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade to Superset 1.0 - https://phabricator.wikimedia.org/T272390 (10elukey) Adding some notes from IRC: Superset is kerberized so the move to kubernetes is a little trickier, since we would need to figure out how that works (passing the keytab,...
[07:58:06] <elukey>	 joal: sure sure
[08:00:52] <wikibugs>	 10Analytics-Radar, 10DBA, 10Patch-For-Review: mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking - https://phabricator.wikimedia.org/T270112 (10elukey) Restarted all instances on dbstore1004 today :(
[08:08:22] <joal>	 Thanks for the changes on capcity-scheduler elukey :)
[08:10:34] <elukey>	 joal: thanks for your patience! Ok to deploy it to the test cluster?
[08:10:48] <joal>	 Yes elukey! let's test :)
[08:17:59] <elukey>	 all right let's start with
[08:18:00] <elukey>	 java.io.IOException: mapping contains invalid or non-leaf queue production.analytics
[08:18:03] <elukey>	 ahahahha
[08:18:10] <joal>	 :/
[08:18:22] <joal>	 ok - not thorough enough of a review :)
[08:18:50] <wikibugs>	 (03CR) 10Bharatkhatri: "any body tell me the reason of merge conflict" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/656541 (https://phabricator.wikimedia.org/T263973) (owner: 10Bharatkhatri)
[08:18:58] <elukey>	 I am going to live hack on the test node
[08:19:33] <joal>	 I don't understand how production.analytics is non-leaf
[08:19:35] <joal>	 weird
[08:21:53] <elukey>	 joal: so it wants only the name of the leaf
[08:21:56] <elukey>	 like "analytics"
[08:22:04] <joal>	 elukey: in user mapping
[08:22:13] <elukey>	 exactly yes
[08:22:14] <joal>	 capcity-queue-mapping sorry
[08:22:33] <joal>	 ok makes sense - I was about to say, maybe we should have used full path (root.production.analytics)
[08:22:39] <joal>	 but it's actually the opposite :)
[08:22:44] <elukey>	 so this have a non-trivial corollary, namely that leaf queues have unique names
[08:23:03] <elukey>	 I tried to use the full path at first, didn't work
[08:23:23] <joal>	 elukey: indeed (see https://issues.apache.org/jira/browse/YARN-6325
[08:24:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [analytics/udplog] - 10https://gerrit.wikimedia.org/r/673596 (https://phabricator.wikimedia.org/T276623) (owner: 10Majavah)
[08:37:54] <elukey>	 joal: all up and running!
[08:38:02] <joal>	 \o/
[08:38:08] <elukey>	  ssh -L 8088:an-test-master1001.eqiad.wmnet:8088 an-test-master1001.eqiad.wmnet
[08:38:08] * joal is gonna have a look
[08:38:32] * joal doesn't even have to make the ssh line - thank you elukey :)
[08:39:35] <joal>	 elukey: so cool! Andrew's job got reassigned correctly :)
[08:40:01] <joal>	 elukey: I think our oozie jobs might not however - let's check
[08:43:19] <joal>	 elukey: actually camus failed
[08:44:11] <elukey>	 joal: I think it is because of the 'essential' queue
[08:44:22] <joal>	 elukey: I think so too - checking
[08:44:28] <elukey>	 I need to change it, lemme try a quick manual run
[08:44:56] <joal>	 elukey: we don't even have logs for the job
[08:45:51] <elukey>	 joal: we do, on an-test-coord
[08:46:02] <elukey>	 I mean the systemd timers one
[08:46:07] <joal>	 Ah of course
[08:46:20] <joal>	 elukey: I was checking logs on yarn
[08:50:44] <wikibugs>	 (03Abandoned) 10WMDE-leszek: Review access change [analytics/wmde/scripts] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/672824 (owner: 10WMDE-leszek)
[08:51:40] <joal>	 elukey: new config works in production.ingest queue
[08:52:23] <elukey>	 joal: goood
[08:52:50] <elukey>	 I am currently applying the setting via puppet for all test camus timers
[08:56:19] <elukey>	 aaand first refine job
[08:56:20] <elukey>	 org.apache.hadoop.security.AccessControlException: User analytics does not have permission to submit application_1616402022758_0004 to queue production
[08:56:23] <elukey>	 :P
[08:56:39] <joal>	 all as expected :)
[08:57:37] <elukey>	 changing puppet
[08:58:23] <elukey>	 joal: ingest or analytics?
[08:58:39] <joal>	 refine should be production.analytics elukey key please :)
[08:58:41] <elukey>	 I suppose analytics
[08:58:44] <elukey>	 ack perfect :)
[09:08:05] <elukey>	 refine now works!
[09:11:25] <elukey>	 need to change druid load too of course
[09:16:04] <elukey>	 joal: so druid load seems working, both analytics and druid users in the analytics queue
[09:28:52] <elukey>	 !log move the yarn scheduler in hadoop test to capacity
[09:28:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:29:09] <wikibugs>	 (03CR) 10Phuedx: [C: 04-2] "Per Ottomata's comment above. I'll update this patch and the one that relies on it." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668743 (https://phabricator.wikimedia.org/T275766) (owner: 10Phuedx)
[11:04:12] <wikibugs>	 (03PS10) 10Phuedx: universalLanguageSelector: Add new properties [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668743 (https://phabricator.wikimedia.org/T275766)
[11:09:02] <wikibugs>	 (03PS2) 10Phuedx: universalLanguageSelector: Add timeToChangeLanguage property [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/672740 (https://phabricator.wikimedia.org/T275794)
[11:56:09] * elukey lunch
[12:44:51] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "+1 oh but one more thought.  You could use the major version bump to remove token from the list of required fields.  At least this way you" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668743 (https://phabricator.wikimedia.org/T275766) (owner: 10Phuedx)
[12:48:37] <ottomata>	 mornin!  o/ mforns  https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/673075 ?
[13:34:42] <elukey>	 o/
[13:35:07] <joal>	 elukey: heya - ok if I restart oozie webrequest job in test cluster?
[13:35:26] <elukey>	 joal: sure! I didn't check, it is stuck?
[13:35:35] <joal>	 elukey: IIRC you even had a script to so ?
[13:35:59] <joal>	 elukey: it is stuck yes, with errors: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User analytics does not have permission to submit application_1616402022758_0010 to queue production
[13:36:11] <elukey>	 joal: I have one yes, if you give me the start time I'll kick it off
[13:36:24] <elukey>	 ah of course it runs in "production"
[13:37:14] <joal>	 elukey: the interesting part is that the job doesn't fail, it goes in suspended mode (oozie-launcher fails to start)
[13:37:25] <elukey>	 ouch, this means that we'll need to roll restart all jobs if we keep this config /o\
[13:37:44] <joal>	 elu expected start time would be: 2021-03-22 08:00
[13:37:49] <joal>	 elukey: indeed :(
[13:38:02] <joal>	 elukey: it was kinda known, and it confirms
[13:38:05] <elukey>	 joal: what if we rename 'analytics' to 'production' ?
[13:38:06] <joal>	 hm
[13:38:34] <joal>	 elukey: why not, 
[13:38:34] <elukey>	 so we won't need to change the spark/druid timers too
[13:38:45] <elukey>	 I mean restarting all oozie jobs is horrible
[13:38:50] <joal>	 elukey: giving it the leaf-queue is enough?
[13:39:03] <elukey>	 joal: I hope so yes
[13:39:08] <joal>	 elukey: restarting all oozie jobs is indeed not nice, but feasible :)
[13:39:33] <elukey>	 ah nono ok of course, root.production.production 
[13:39:35] <elukey>	 uff
[13:39:47] <elukey>	 this is not nice in all directions
[13:39:56] <joal>	 right :(
[13:40:21] <joal>	 elukey: I'm gonna kill the current webrequest in test - ok ?
[13:40:50] <elukey>	 joal: yes yes let's think about a more conservative solution
[13:41:54] <joal>	 ok elukey, killed
[13:42:27] <joal>	 elukey: I can do the modifs and restart if you wish
[13:43:32] <elukey>	 joal: restarted
[13:43:53] <elukey>	 it should work, but restarting all oozie jobs..
[13:44:16] <joal>	 actually elukey, 2 coords running - weird
[13:45:04] <joal>	 up elukey - 2 coords started - shall I kill one?
[13:45:12] <elukey>	 yep yep probably my bad
[13:46:20] <joal>	 elukey: confirmed webrequest starts flowing again
[13:46:58] <elukey>	 ack!
[13:51:10] <joal>	 interesting elukey!
[13:51:16] <joal>	 I can't submit spark jobs
[13:51:24] <joal>	 User joal does not have permission to submit application_1616402022758_0090 to queue default
[13:52:35] <elukey>	 joal: I am pretty sure because you are a bad person :D
[13:52:49] <joal>	 I know :)
[13:52:58] <joal>	 elukey: this is why I ask you ;)
[13:53:16] <elukey>	 but you should be in privatedata-users mmm
[13:53:57] <elukey>	 joal: have you specified the queue?
[13:54:06] <elukey>	 or did it work by itself via mappings?
[13:54:10] <joal>	 both specifying and not
[13:54:17] <joal>	 queue specified: default
[13:54:28] <joal>	 I tried with others: users.default (not exists)
[13:55:10] <elukey>	 joal: I suspect that the syntax to allow groups in the acl stuff is different
[13:55:22] <milimetric>	 mornin yall
[13:55:27] <elukey>	 'yarn.scheduler.capacity.root.users.default.acl_submit_applications' => 'analytics-privatedata-users',
[13:55:30] <joal>	 good morning milimetric 
[13:55:35] <elukey>	 this probably means "user analytics-privatedata-users"
[13:55:40] <elukey>	 milimetric: o/
[13:55:48] <milimetric>	 shameless plug for https://phabricator.wikimedia.org/T265765#6930945 (the visidata thing I was messing with)
[13:55:56] <milimetric>	 I'd love to get some eyes on it and spiff it up
[14:03:53] <elukey>	 ottomata: o/
[14:03:57] <elukey>	 do you have a minute?
[14:03:58] <ottomata>	 hello!
[14:04:03] <ottomata>	 ya!
[14:04:21] <elukey>	 I am trying to drop some files on thorium, that are rsynced from stat1006
[14:04:33] <elukey>	 so I have already moved them on stat1006 to /srv/backup just in case
[14:04:40] <mforns>	 hey all!
[14:04:46] <mforns>	 ottomata: looking at the change
[14:04:56] <elukey>	 on thorium though, I keep seeing things like 632G .hardsync.datasets.99pIBqzpm8R7
[14:05:01] <mforns>	 joal: we have an interview today, wanna sync before?
[14:05:06] <elukey>	 in /srv, that hold the files there
[14:05:17] <joal>	 sure mforns 
[14:05:19] <elukey>	 I don't recal how the script works though
[14:05:29] <ottomata>	 i have to open it to recall too lets see
[14:06:46] <joal>	 mforns: noew?
[14:06:56] <mforns>	 joal: ok!
[14:06:58] <mforns>	 bc?
[14:07:08] <joal>	 OMW
[14:09:53] <ottomata>	 huh, intresting elukey those should be moved into the destinatino datasets dir
[14:10:09] <ottomata>	 all the files in their are hardlinks to the souce dirs
[14:10:14] <ottomata>	 the ones named for each stat box
[14:10:34] <ottomata>	 once the script makes hardlink copies of the files in the source dirs into that tmpdir
[14:10:52] <ottomata>	 it should move that tmp dir into place at dste dir
[14:10:54] <ottomata>	 dest dir*(
[14:10:59] <ottomata>	 temp_dest=$(mktemp -d$mktemp_dry_run $base_temp_dir/.hardsync.$(basename $dest_dir).XXXXXXXXXXXX)
[14:11:04] <ottomata>	 later
[14:11:04] <ottomata>	 cmd mv -f "$temp_dest" "$dest_dir"
[14:11:28] <ottomata>	 also, those trash ones
[14:11:30] <ottomata>	 should be removed
[14:11:30] <ottomata>	 cmd rm -rf "$temp_dest_trash"
[14:11:54] <ottomata>	 elukey:  also, afaics, those are al old?
[14:12:03] <ottomata>	 oh
[14:12:06] <ottomata>	 no there are new ones sorry
[14:12:20] <ottomata>	 but, they aren't every day?
[14:13:53] <ottomata>	 elukey:  hardsync-published is still a cron
[14:13:55] <ottomata>	 but...
[14:14:01] <ottomata>	 it is commented out (twice) in root's crontab?
[14:14:24] <elukey>	 ottomata: I have commented it (with puppet disabled) 
[14:14:28] <elukey>	 to drop files :)
[14:14:28] <ottomata>	 oh ok
[14:14:54] <ottomata>	 elukey:  the script output is just being dumped to dev/null
[14:14:59] <ottomata>	 and, it runs with set -e
[14:15:00] <ottomata>	 so maybe
[14:15:06] <ottomata>	 it is failing sometimes in between steps?
[14:15:16] <ottomata>	 and that is causing those temp files to remain
[14:15:19] <ottomata>	 in any case...they are all hardlinks
[14:15:26] <ottomata>	 so they should be safe to remove
[14:16:17] <elukey>	 ottomata: okok I just wanted an extra pair of eyes before taking actions
[14:16:21] <elukey>	 will do thanks :)
[14:16:24] <ottomata>	 k coo
[14:23:13] <wikibugs>	 (03CR) 10Mholloway: [C: 04-1] Image recommendations table for android (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244 (owner: 10Sharvaniharan)
[14:35:53] <joal>	 elukey: syntax for ACLs is "user1,user2,... group1,group2,.." - We need a space beore the first group (even if there is no user
[14:36:09] <elukey>	 joal: I was wondering the same, but I've read
[14:36:30] <elukey>	 (can't find it)
[14:36:41] <joal>	 elukey: found it here: https://partners-intl.aliyun.com/help/doc-detail/62958.htm
[14:36:44] <elukey>	 anyway, will try with a manual hack adding space blabla
[14:36:45] <wikibugs>	 (03CR) 10Milimetric: [C: 04-2] "This is not how we make changes on wikistats.  If you would like to edit the translations, you can do so through translatewiki, some docum" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/656541 (https://phabricator.wikimedia.org/T263973) (owner: 10Bharatkhatri)
[14:36:48] <elukey>	 joal: ah okok!
[14:37:00] <wikibugs>	 (03Abandoned) 10Milimetric: Remove typo error [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/656541 (https://phabricator.wikimedia.org/T263973) (owner: 10Bharatkhatri)
[14:37:39] <joal>	 elukey: disallow-all is space-only, as in "no-user no-group"
[14:37:55] <elukey>	 joal: a clear syntax yes
[14:38:01] <elukey>	 how can people get it wrong
[14:38:06] <joal>	 Mwahahaha
[14:40:24] <elukey>	 joal: can you re-test?
[14:40:34] <joal>	 sure elukey 
[14:42:00] <joal>	 elukey: success (error from another bit, but success for yarn)
[14:43:36] <elukey>	 joal: filing a change for puppet
[14:43:48] <elukey>	 is the error related to the scheduler? I mean, should I wait?
[14:44:18] <joal>	 nope elukey, all good
[14:44:47] <joal>	 elukey: testing resource allocation: all good
[14:45:17] <joal>	 elukey: the fact that our users.default queue is named default means that most jobs-launch will succeed without problme
[14:46:39] <joal>	 elukey: I'm gonna try to use the production queue as my user, just to check
[14:46:45] <joal>	 elukey: and then from analytics
[14:48:16] <elukey>	 joal: +1
[14:56:21] <joal>	 elukey: confirmed it all works as expected
[14:56:27] <elukey>	 \o/
[14:56:56] <joal>	 elukey: things I'd like to triple check: killing jobs (actually, being refused to), and moving jobs between queues
[14:57:04] <joal>	 leaving for kids now, will do it later
[14:59:30] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10SBisson)
[15:01:33] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: KaiOS / Inuka Event Platform client - https://phabricator.wikimedia.org/T273219 (10SBisson) a:03SBisson https://github.com/wikimedia/wikipedia-kaios/pull/358
[15:07:20] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 3 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Mholloway) @nray Do you plan to review updated patch (updated only to...
[15:20:27] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10Milimetric)
[15:20:35] <wikibugs>	 10Analytics, 10Product-Analytics: Default table creation settings results in warnings when querying - https://phabricator.wikimedia.org/T277822 (10Milimetric) (rolled up into bigtop migration cleanup task (not messing with subtasks))
[15:23:18] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured-Data-Backlog: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10Milimetric) p:05Triage→03Medium
[15:23:48] <elukey>	 razzi: o/
[15:24:05] <elukey>	 I tried to connect to clouddb1021 from an-launcher and it works
[15:27:05] <elukey>	 so there are two accounts/sqoop scripts
[15:27:07] <razzi>	 elukey: what credentials did you use?
[15:27:14] <elukey>	 /srv/deployment/analytics/refinery/bin/sqoop-mediawiki-tables -> uses clouddb1021
[15:27:35] <elukey>	 sorry
[15:27:39] <elukey>	 /usr/local/bin/refinery-sqoop-mediawiki
[15:27:49] <elukey>	 and we also have, to keep things not confusing at all
[15:28:00] <elukey>	 /usr/local/bin/refinery-sqoop-mediawiki-production
[15:28:09] <elukey>	 that hits the dbstore100x nodes
[15:28:23] <elukey>	 the former uses the credentials to test, the latter a research/password combination
[15:28:29] <elukey>	 that works only on dbstores
[15:31:26] <wikibugs>	 (03CR) 10Mholloway: [C: 04-1] Image recommendations table for android (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244 (owner: 10Sharvaniharan)
[15:40:23] <wikibugs>	 10Analytics-Clusters: Migrate eventlog1002 to buster - https://phabricator.wikimedia.org/T278137 (10Ottomata)
[15:40:41] <wikibugs>	 10Analytics, 10Product-Analytics (Kanban): Hive table neilpquinn.toledo_pageviews missing almost all data - https://phabricator.wikimedia.org/T277781 (10nshahquinn-wmf) @Ottomata and @JAllemandou, thank you very much for investigating this!  I realized that on 2020-12-10, I was working on T261953/T267940. As p...
[15:52:59] <razzi>	 !log rebalance kafka partitions for webrequest_text partition 2
[15:53:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:20:05] <elukey>	 razzi: how many partitions left? Are you able to do two at the time now?
[16:21:10] <razzi>	 elukey: 21 partitions to go; each partition for webrequest_text is about twice as big as webrequest_upload which we used to test 2 at a time, but I could try 2 at a time for webrequest_text anyways
[16:21:58] <elukey>	 razzi: I'd be +1 to test, let's get ottomata's approval as well, so you'll be able to cut down the time (hopefully)
[16:22:15] <elukey>	 test == try one time with 2 text partitions at the same time
[17:30:47] <elukey>	 ottomata: if you have time, I'd need some extra eyes for thorium
[17:31:06] <elukey>	 so I see a big dir under /srv
[17:31:07] <elukey>	 620G.hardsync.datasets.d6TU2Sb3m5Dw
[17:31:50] <elukey>	 now inside it there is the big dir that I want to delete, so if I cd in there and rm -rf it it returns instantly
[17:32:05] <elukey>	 and if I re-run du, I find another .hardsync.etc.. hardlink
[17:32:32] <elukey>	 so I am very puzzled about what's happening :D
[17:32:45] <elukey>	 I didn't try to rm .hardsync.etc.. directly 
[17:33:34] <milimetric>	 razzi / elukey: just got out of meeting, will retry sqoop shortly and let you know
[17:34:59] <razzi>	 milimetric: I'd like to watch over your shoulder, I'm still confused on how to test sqoop
[17:35:37] <milimetric>	 razzi: omw cave
[17:36:03] <ottomata>	 elukey:  razzi  +1 with 2 text partitions
[17:36:37] <ottomata>	 elukey:  not sure i understand
[17:36:46] <ottomata>	 about hardsync
[17:43:30] <elukey>	 ottomata: so what I did is
[17:43:31] <elukey>	 cd /srv
[17:43:40] <elukey>	 du -sch .[!.]* * |sort -h
[17:43:59] <elukey>	 and the biggest dir is a .hardsync.dataset.etc..
[17:44:14] <elukey>	 I am wondering what to do with it
[17:45:17] <icinga-wm>	 PROBLEM - AQS root url on aqs1011 is CRITICAL: connect to address 10.64.16.201 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[17:45:43] <elukey>	 new node :
[17:46:32] <wikibugs>	 10Analytics, 10Data-Services, 10Machine-Learning-Team, 10ORES, and 2 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10calbon) This looks super interesting, when the data is out there I'd love to have it posted for a potential inte...
[17:48:21] <icinga-wm>	 ACKNOWLEDGEMENT - AQS root url on aqs1011 is CRITICAL: connect to address 10.64.16.201 and port 7232: Connection refused Hnowlan Host not in use yet. https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[17:50:34] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 3 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10nray) @Mholloway Sorry for the delay on that, I will for sure review...
[17:54:02] <wikibugs>	 (03PS4) 10Milimetric: Update mysql resolver to work with cloud replicas [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666209 (https://phabricator.wikimedia.org/T274690)
[18:01:58] <elukey>	 !log drop /srv/.hardsync*trash* on thorium - old hardlinks that should have been trashed
[18:02:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:03:41] <elukey>	 ottomata: ok now I get what's happening
[18:04:17] <elukey>	 we have 100+ .hardsync* dirs under /srv, all containing a reference of the data dir that I want to delete
[18:06:44] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Data-Infrastructure, and 3 others: prefUpdate schema contains multiple identical events for the same preference update - https://phabricator.wikimedia.org/T218835 (10Mholloway) No worries, @nray!  Thanks for the update.
[18:07:19] <elukey>	 !log run rm -rfv .hardsync.*/archive/public-datasets/* on thorium:/srv to clean up files to drop (didn't work)
[18:07:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:07:38] <ottomata>	 elukey:  everything in there should just be hardlinks
[18:07:38] <ottomata>	 so
[18:07:54] <ottomata>	 i'd just delete all the .hardsync dirs
[18:08:00] <ottomata>	 they shouldn't be there anyway
[18:08:16] <elukey>	 ottomata: ok I'll proceed then
[18:12:53] <elukey>	 !log drop /srv/.hardsync* to clean up hardlinks not needed
[18:12:54] <elukey>	 /dev/mapper/thorium--vg-data  7.2T  266G  6.6T   4% /srv
[18:12:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:12:58] <elukey>	 ottomata: success :)
[18:18:16] <wikibugs>	 10Analytics, 10Analytics-Kanban: Check data currently stored on thorium and drop what it is not needed anymore - https://phabricator.wikimedia.org/T265971 (10elukey) ` elukey@thorium:/srv$ sudo du -hs * 177G analytics.wikimedia.org 8.0K deployment 4.0K log 16K lost+found 3.5G org.wikimedia.community-analytics...
[18:19:32] <elukey>	 ottomata: the next step is to figure out the max space that we'll allow for thorium's published-datasets
[18:19:56] <elukey>	 would something like 480G work ? (On raid 1)
[18:20:32] <elukey>	 we currently use around 240G
[18:20:33] <elukey>	 ish
[18:25:06] * elukey afk!
[18:25:10] <elukey>	 have a good rest of the day folks
[18:25:15] <ottomata>	 elukey: that could be good
[18:25:18] <ottomata>	 i wonder if it would grow
[18:25:39] <elukey>	 yes I wonder the same, in case we'll need to order a beefier node
[18:25:39] <ottomata>	 but i think we should just see what standard hw willy comes up with and pick something close to that
[18:25:40] <ottomata>	 laters!
[18:25:45] <elukey>	 +1
[19:08:35] <joal>	 Gone for tonight - see you tomorrow team
[20:56:00] <wikibugs>	 (03PS3) 10Ottomata: Add support for finding RefineTarget inputs from Hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/673604 (https://phabricator.wikimedia.org/T212451)
[21:00:24] <ottomata>	 mforns:  just pushed up a less WIP patch
[21:00:43] <ottomata>	 sorry it is so much...but i think i removed some extra code and made config loading much simpler
[21:00:51] <mforns>	 ok, ottomata, lookin, didn't finish yet because of interview, sorry
[21:00:53] <ottomata>	 lemme know if discussing would help review
[21:01:23] <mforns>	 ottomata: sure! if you want to walk me over changes :]
[21:01:35] <ottomata>	 wanna take a look first?  or bc now? either is good for me
[21:02:53] <mforns>	 ottomata: let's chat for 10 mins now, if OK
[21:03:27] <ottomata>	 k
[22:24:05] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+2] "Tested against current master of WikimediaEvents and then again with patch https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikimedia" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668743 (https://phabricator.wikimedia.org/T275766) (owner: 10Phuedx)
[22:24:44] <wikibugs>	 (03Merged) 10jenkins-bot: universalLanguageSelector: Add new properties [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668743 (https://phabricator.wikimedia.org/T275766) (owner: 10Phuedx)
[22:38:15] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:00:43] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers