[00:06:46] If i want to do something "foreach user" in puppet, where user is the list of our shell users in the admin module, I guess I have to do all the things the admin module does, loadyaml from data.yaml, user add_all_users parser function, extra regular users etc.. or is there already a shortcut [03:42:34] kormat: sobanski: https://phabricator.wikimedia.org/T282761 this might be urgent. I'm having trouble estimating whether or not this is exponential, so things might go very quickly with regards to parser cache, or they might not. I can work on this tomorrow+Friday, just tell me what you want to do :) [04:35:04] uh ohes [06:28:47] Krinkle: still didn't find time to review the incident doc but I have it in my todos :) [06:29:03] does anybody know what is the status of logstash100[7-9] ? [06:29:09] ES on it seems borked [06:30:22] (mmm not sure if the right term, I was convinced otherwise, not responsive may be more appropriate :D) [06:31:25] not enough master nodes discovered during pinging (found [[]], but needed [2]), pinging again [06:32:56] maybe the ES nodes are not in the cluster anymore, I see from prev logs that 1011 was considered master [06:34:53] ah yes https://gerrit.wikimedia.org/r/c/operations/puppet/+/689977 [06:36:55] also commented in the task [10:42:24] mutante: we could do something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/690366/3 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/690367/4 but curious of the use case [11:23:39] FYI This morning I've updagraded cumin to the latest version on all prod hosts, see the email to ops-private for more details. [16:10:10] jynus: fair point re: transactions. this might be a mw dev vs dba thing, or maybe it's just me being stubborn. I generally talk about transactions when it's about staging multiple write queries to be committed at once (and, to me, are thus the kind of thing that becomes very relevant when talking about concurrent requests and replication lag). [16:10:27] it was mostly offtopic [16:10:41] I acknowledge theyre internally all transactions. but I woudl also say "no transaction" when it's literally BEGIN/UPDATE/COMMIT :) [16:10:53] but I saw you saying it twice and wanted to clarify terminology, in case it was relevant for this or future discussions [16:11:04] e.g. regarding performance [16:11:23] as transactions can have a lot on impact in performance and locking, etc. [16:11:53] maybe what you meant is what we call "autocommit mode" [16:12:17] yeah, it's a good thing to remember that implicit transactions don't mean that row locks, gap locks etc aren't a thing still, they are. though not applicable here since these are all simple append only and primary key deletes. [16:12:27] sure [16:12:55] "no large transactions" [16:13:15] in time or bytes [16:14:56] it is not important, but wanted to sync with you in terms of terminology to make sure we are in sync [16:15:04] I wonder if it would help if there was like an autoincrement in front of this table. I think actually these are not strictly append-only. Afaik the key is deterministic, so we do end up replacing old rows indeed. [16:15:49] in terms of making the deletes faster by int key. and only delete/insert instead of replace. But that'd be harder then to make sure the key stays unique [16:15:52] I guess that's worse [16:16:01] I think you will meet lucas later [16:16:07] you make lots of good questions [16:16:45] but aside from giving you very generic responses, it will require testing to give you definiteve answers :-) [16:16:52] and if we do replacements of old rows, that means there could actually be locking interactions between main app server traffic and the purge script. I had not considered that before. [16:17:24] please note manuel and stevie will be the best people to provide advice or just test a few of the options with you [16:17:44] I believe we can and do avoid those locks, but it's at least within arms reach, and not so categorically impossible as I claimed. [16:18:01] yeah, they've been great (hello!). just thinking outloud here [16:18:18] I am happy to give you couch-answers but they will be able to test them in the field [16:18:26] happy to have this conversation [16:18:29] *s [16:18:56] just they are very theoretical on my side mostly [16:19:25] let me give you 2 extra options [16:19:29] BTW [16:19:34] just sittin' on the couch at the bay with uncle jaime, watchin' the tide roll away. [16:19:41] replicating purges on a different channel [16:20:10] multi-channel (domains, I think they are called) replication could solve the issue of fast purges, if they run on different "lanes" [16:20:35] the other is partitioning to being able to drop quickly lots of rows (but has some limitations, disadvantages) [16:20:47] so on purge you just do "drop partition" in theory faster [16:21:11] throwing lot of stuff there, and see what sticks is my advice [16:22:30] first is: https://mariadb.com/kb/en/parallel-replication/#out-of-order-parallel-replication [16:24:06] second is: https://dev.mysql.com/doc/refman/8.0/en/alter-table-partition-operations.html [18:16:06] jbond42: wow, you already have WIP patch sets? amazing! the use case was to ensure a public_html exists in each home if on a host with peopleweb role, just to skip that one step in the docs that says "if it doesn't exist, create it with mkdir". not worth too much effort, but really appreciate it as well [19:22:07] mutante: sounds like adding /etc/skel/public_home on people may help. not sure what the admin madule use but i think if managehome => true th /etc/skel is used (i.e. useradd -m is used) [19:22:26] that may be the better route to expore, i can take a look tomorrow [19:23:44] jbond42: oh yea, /etc/skel is a good point. also it's only a subgroup of the users in admin, the ones with actual shell and not absented. [19:24:59] thanks, i'll make a patch later today [19:25:04] like i said not sure if the admin class expose managehome gut we probably can if not [19:25:18] ack [23:03:00] anyone have any ideas on how to troubleshoot the MD RAID status of `wdqs2007`? [23:03:17] https://www.irccloud.com/pastebin/gYRDh8Mc/ [23:03:38] I imagine I want to open a hw troubleshooting ticket with dcops, but wondering if there's any really simple things I should do first to investigate [23:11:31] ryankemper: maybe sudo /sbin/mdadm --detail [23:12:45] [wdqs2007:~] $ sudo /sbin/mdadm --detail /dev/md/0 [23:13:47] that gives us that /dev/sdh2 is the faulty one [23:13:54] would add that output / info to dcops ticket [23:14:18] 7/8 are ok, sdh2 is not [23:14:28] mutante: thanks! that makes sense [23:14:54] yep, yw [23:19:35] ryankemper: so after that maybe the next question is "but which physical device is /dev/sdh", 'sudo lshw -class disk' lets us know serial: 193023009C1B and other info that is /dev/sdh [23:19:50] if you add that as well it should be easy to locate for dcops i think [23:19:55] great idea [23:20:47] there is 0, 1, 2 and 3 and it's the "last" one, disk:3 [23:24:58] Cool added all of that info to https://phabricator.wikimedia.org/T281437#7086866 [23:26:16] looks good to me:) the part I am never sure about is just when it comes back, who is going to rebuild it [23:27:04] hopefully it is mostly self-fixing once the hardware is replaced