[00:10:22] PROBLEM - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:12:27] 10Analytics, 10CheckUser, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10kaldari) @kchapman - This was being worked on by a volunteer a year ago, but never made it throug... [07:11:58] RECOVERY - Check the last execution of drop-el-unsanitized-events on an-launcher1002 is OK: OK: Status of the systemd unit drop-el-unsanitized-events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:23:23] Good morning [07:23:36] bonjour [07:29:25] 10Analytics, 10Analytics-Kanban: Investigate oozie banner monthly job timeouts - https://phabricator.wikimedia.org/T264358 (10JAllemandou) Ah! I get it ! sorry for being slow :) [07:33:14] (03CR) 10Joal: [C: 03+1] "Let's test that and make it happen :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [07:35:45] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10elukey) >>! In T268801#6667391, @DStrine wrote: > I'm not sure what this is but I'm pretty sure I don't use this. thanks for the ping. Thanks @DStrine! Since this group... [10:03:24] Is there still a rule that eventlogging schema fields can never be removed? I thought that had to do with the mysql importer, which is no longer running? [10:12:21] awight: hi! We have some code that creates "safe" alters in hive, for example adding a field is not an issue, but I am not sure if removing it is as well [10:13:09] from my deep ignorance in that code I'd say no, but Andrew is probably the best poc for this [10:13:13] is it urgent? [10:13:29] elukey: Thanks, we're okay with leaving the old field, and not urgent. Just a clean-up. [10:13:55] elukey: But if you don't mind, we could experiment with doing the field remove, to see if anything melts down... [10:14:01] It's a new schema with no external consumers. [10:14:51] awight: ehm we get alarms if something breaks at refine time, so I'd ask you to wait if possible just to avoid alerts :D I promise that we'll establish a convention on wikitech if not present already [10:15:20] elukey: We can try this in January if that's better--or never :-) [10:15:31] It's fine to just send a "false" into this deprecated field forever... [10:20:00] awight: Next week would be also fine as well, I am saying not on friday if possible :D [10:29:35] elukey: ah hehe thank you for the gentle reminder! [11:28:59] Hi mforns - Could you please let know when you're on? [11:39:17] * elukey lunch! [11:49:26] Actually taking a break now - Will get back in ~2h [12:08:11] (03PS1) 10Gerrit maintenance bot: Add eo.wikivoyage to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/645330 (https://phabricator.wikimedia.org/T269426) [12:09:08] heya joal I'm here [12:09:35] oh, now read you were taking a break [12:10:02] leaving also for some errands [12:11:42] (03PS1) 10Gerrit maintenance bot: Add wa.wikisource to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/645332 (https://phabricator.wikimedia.org/T269431) [12:55:47] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10Jhernandez) I've used it in the past for Hadoop/Hive queries I believe, but it has been some time since I've need it. I'd prefer to be removed, and if/when I need it I'l... [13:09:59] (03PS1) 10Gerrit maintenance bot: Add mad.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/645344 (https://phabricator.wikimedia.org/T269437) [13:20:16] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) I started adding servers to the racks, I anticipate that these should be ready by the end of next week. [13:38:05] (03PS1) 10Andrew-WMDE: [WIP] Process EventLogging events for CodeMirror [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/645345 (https://phabricator.wikimedia.org/T260138) [13:41:05] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html [13:41:18] "Applications are packaged using a system based on Apache BigTop, which is an open-source project associated with the Hadoop ecosystem." [13:41:42] 10Analytics-Clusters, 10Operations, 10ops-eqiad: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) There are a couple of fatal errors on this server. I have pulled a TSR report from the server and sent to Dell. This may be a bad motherboard. A fat... [13:54:02] 10Analytics-Clusters, 10Operations, 10ops-eqiad: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10elukey) @Cmjohnson is there any chance that Dell could replace the server? [13:57:39] 10Analytics-Clusters, 10Operations, 10ops-eqiad: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) @elukey no, they will only replace parts [13:57:56] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10elukey) >>! In T268801#6669131, @Jhernandez wrote: > I've used it in the past for Hadoop/Hive queries I believe, but it has been some time since I've need it. I'd prefer... [14:10:08] haha great [14:11:41] :) [14:20:21] ottomata: earlier on a*wight asked if it was still forbidden to remove a field from an eventlogging schema [14:28:41] And threatened to just try removing it next week, to empirically check for "magic smoke" release from the mainframe :-) [14:34:49] ahahahha I didn't want to ping you but it didn't work :) [14:34:50] mforns: would you have time now? [14:35:11] joal: if you have a moment, can you ssh to stat1004? [14:35:17] I sure can [14:36:04] This is awesome elukey :) [14:36:32] nice! I need to roll it out elsewhere, it is just a test [14:36:43] for the moment it does not do kinit -R [14:36:43] elukey: I don't know how feasible it is, or if you might dislike it, but would we put some color on those lines? [14:37:14] A slight red if no ticket is found, and slight green if ticke [14:37:31] Just an idea - The fact the the info appears is already [14:37:33] great [14:38:34] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data, 10Wikipedia-iOS-App-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10sdkim) [14:38:51] ah okok I can take a look joal, never done it [14:39:13] awight: it isn't a good idea [14:39:14] * joal is proud to try to make elukey do some desing :-P [14:39:17] it can be done [14:39:20] but it can't be automated [14:39:45] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data, 10Wikipedia-iOS-App-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10sdkim) >>! In T269360#6668066, @JMinor wrote: > Looks like the monthly data for November is now up and working. T... [14:40:13] why / what do you want to do? [14:41:48] joal: are you saying, behind the lines, that I am not a design-friendly person?? :D I am offended :D [14:42:40] ottomata: Okay no problem. We've deployed a WIP eventlogging schema and after discussion want to drop a boolean field in favor of adding an "action" enum. So this is just low-priority cleanup. What we'll do instead is introduce the new field (as an optional field), and will send "false" in the old required field. [14:45:56] elukey: You certainly are a friendly person! :D [15:03:39] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) >>! In T260445#6669187, @Cmjohnson wrote: > I started adding servers to the racks, I anticipate that these should be ready by the end of nex... [15:04:28] joal: back [15:06:58] sorry for the schedule missmatch, let me know when you can meet [15:09:31] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey I was barely able to scrape enough u space to get all of these into racks. I will do my best to balance but most of the free 2U... [15:27:39] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Updating the task after a chat over IRC: ideally the 24 new nodes could be spread 6 on each row, and some asymmetry in the final distributio... [15:42:16] ottomata: so we have some space issues in eqiad even with the rack moves, we'll probably get batches of workers racked as we get space [15:42:32] to keep them spread as much as possible between the rows [15:43:05] ok [15:49:14] ottomata: also ok if I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/645320/ ? [15:52:16] +1 elukey ! [15:52:41] hm waiit elukey do you want to put that in /etc ? [15:52:47] maybe /usr/local/bin/ is better? [15:53:31] also, would it be useful to exit 1 if no kerberos ticket? [15:53:32] then you ucould do [15:53:40] kerberos_ticket_info.sh || kinit [15:56:55] ottomata: it will go in /etc/profile.d/etc.. like the rest no? [15:57:15] it is to display stuff right after loging/ssh [15:58:14] (a sort of extended motd) [16:06:58] 10Analytics, 10Event-Platform, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) [16:07:25] 10Analytics, 10Event-Platform, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) [16:07:40] 10Analytics-Clusters, 10Patch-For-Review: Reduce manual kinit frequency on stat100x hosts - https://phabricator.wikimedia.org/T268985 (10elukey) >>! In T268985#6657803, @nettrom_WMF wrote: > This would be awesome! Is there a way to do this for Jupyter as well, since `kinit` within Jupyter is distinct from outs... [16:08:10] 10Analytics, 10Event-Platform, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) [16:08:39] 10Analytics, 10Event-Platform, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) [16:10:40] 10Analytics, 10Event-Platform, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) > We'd also have to some how tie the Kafka produce call with the MySQL DB write call into a transaction. To do this I think we'd need some kind of two phase... [16:12:39] 10Analytics, 10Event-Platform, 10Platform Engineering, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) [16:14:47] 10Analytics, 10Event-Platform, 10Platform Engineering, 10Services (later): Reliable (atomic) MediaWiki event production - https://phabricator.wikimedia.org/T120242 (10Ottomata) @dianamontalion I updated this task with what I hope is more descriptive of the problem and some possible solutions. I really th... [16:16:01] joal, milimetric et.al. i just updated https://phabricator.wikimedia.org/T12024 with hopefully what is a better description of the problem for Reliable (atomic) MediaWiki event production [16:16:11] lemme know what you think and if i got anything wrong or could rephrase things [16:17:35] ottomata: you left off a 2 at the end of that task # [16:17:44] https://phabricator.wikimedia.org/T120242 [16:23:57] ottomata: I'm so sad that we've been talking about this for 5!!! Years [16:24:37] the description is great. It does make it seem like we've been talking about the same exact thing for five years, which is not quite right :) [16:25:15] I would love Diana/Eric's opinion [16:31:51] well it hasn't really matterred that much until now [16:35:40] ottomata: do we need to page SRE if a kafka jumbo broker is down? I just realized it, it seems a little extreme with 9 nodes [16:35:55] elukey: hm [16:36:15] i guess not, do we have other pages that would suffice? [16:36:25] could we page if N brokers are down? [16:37:26] yeah this could be good, or maybe if a prometheus metric shows some extreme weirdness, like offline partitions etc.. (I am just thinking out loud, not sure if this metric is the correct one, just as example) [16:38:04] for kafka main it is essential, especially with only 3 nodes [16:38:49] ottomata: OR, but you may not like it, we add a group in our paging system for me you and Razzi, and let it page only us [16:38:51] yeah that would be better, if we could just configure it [16:39:08] 'kafka broker alive alert threshold' [16:40:19] * elukey bbiab! [16:49:41] a-team: I'm in the batcave hanging out with myself [16:49:53] joining :] [16:49:55] :) [16:50:55] milimetric: can you pass me the batcave link, having problems with my browser [16:51:20] https://meet.google.com/rxb-bjxn-nip [16:51:32] http://bit.ly/a-batcave [17:31:11] 10Analytics, 10CheckUser, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10Reedy) I don't think Kate is the right person these days. If you want Platform, I guess you migh... [17:49:22] 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10elukey) I just checked and Turnilo 1.28.1 seems out! https://www.npmjs.com/package/turnilo/v/1.28.1 [18:21:50] 10Analytics-Clusters: Reduce manual kinit frequency on stat100x hosts - https://phabricator.wikimedia.org/T268985 (10elukey) Little update: I added a script under /etc/profile.d that informs the user about the need for a kinit or not right after ssh. For example: ` ssh stat1004.eqiad.wmnet [...] Debian GNU/Lin... [18:22:54] razzi: o/ [18:23:13] hi elukey [18:23:23] so about the kafka cr, I had a chat with Cole (from SRE) and it is best if we keep in sync the role kafka logging yaml file too [18:23:27] even if it is probably not needed [18:23:37] after that I think that we are ready to merge [18:25:19] elukey: ack, didn't mean to leave that one out either [18:40:52] perfect looks good now, feel free to merge and run puppet [18:41:55] I am logging off for the weekend, see you on Wednesday! [19:16:52] 10Analytics: Kerberos Password - https://phabricator.wikimedia.org/T269472 (10Swagoel) [19:54:03] (03PS1) 10Mforns: Add netflow to eventlogging sanitization include-list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/645419 (https://phabricator.wikimedia.org/T231339) [20:48:21] 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10razzi) My thought for next steps here is to install superset on an-tool1010, using the existing database at an-coord1001, and testing that caching works as expected. [20:59:21] 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10elukey) >>! In T268219#6670249, @razzi wrote: > My thought for next steps here is to install superset on an-tool1010, using the existing database at an-coord1001, and testing that c... [22:39:45] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data, 10Wikipedia-iOS-App-Backlog: [Bug] Metrics API missing November and December data - https://phabricator.wikimedia.org/T269360 (10SNowick_WMF) a:03SNowick_WMF