[00:53:52] PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:23:58] PROBLEM - Check the last execution of monitor_refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:06:28] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: AQS is not OpenAPI 3 compliant - https://phabricator.wikimedia.org/T240995 (10colewhite) By all means. The patch was generated by a tool, and I applied some manual stylistic formatting you may or may not want. Have a look and do with it what you see fit. [04:11:19] 10Analytics, 10Analytics-Kanban: Tune up thresholds of data quality hourly alarms - https://phabricator.wikimedia.org/T251814 (10Nuria) 05Resolved→03Open [04:11:22] 10Analytics, 10Analytics-Kanban: Add hourly resolution to data quality outage/censhorship alarms - https://phabricator.wikimedia.org/T249759 (10Nuria) [04:11:47] (03PS1) 10Nuria: Doubling threshold to reduce false positive alarms [analytics/refinery] - 10https://gerrit.wikimedia.org/r/627640 (https://phabricator.wikimedia.org/T251814) [04:45:10] 10Analytics, 10Domains, 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Blocking all third-party storage access requests - https://phabricator.wikimedia.org/T262996 (10RolandUnger) [04:46:37] 10Analytics, 10Domains, 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: WMF third-party cookies rejected - https://phabricator.wikimedia.org/T262882 (10RolandUnger) [07:14:05] good morning [07:14:17] a vote is being held to move Superset to apache top project! [07:20:01] (03PS1) 10Elukey: Update to Superset 0.37.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/627738 (https://phabricator.wikimedia.org/T262162) [07:23:38] and what will your vote be? :-) [07:24:19] I am really happy about how the evolved from complete mess to reliable upstream, so if my vote counted I'd +1! [07:26:05] ah, I assumed all Apache members could vote [07:28:15] so I think this is more a community + PMC vote, so I can probably add my +1 as well as user [07:28:23] will do it even if it might not count :) [07:36:18] done! [08:05:10] (03PS2) 10Elukey: Update to Superset 0.37.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/627738 (https://phabricator.wikimedia.org/T262162) [08:11:17] !log superset 0.37.1 deployed to an-tool1005 (staging env) [08:11:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:25:33] superset seems working, I am noticing more errors in charts related to presto queries taking longer that 60s to complete [08:42:47] there are now dashboards making looong queries to druid and presto [08:43:57] morning! [08:44:24] guten tag! [08:44:25] elukey: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients#Checking_the_space_used_by_your_files I added a reference to ncdu here. I find it very useful when cleaning out old filesystem trees [08:45:05] ah wow TIL! [08:45:08] thanks a lot [08:45:15] the link in the wiki seems broken [08:50:25] I'll fix it [08:51:00] done [08:52:47] thanks! [09:29:15] elukey: so what do I need to do for the Cloud access thingy? [09:32:38] 10Analytics-Clusters: Upgrade to Superset 0.37.x - https://phabricator.wikimedia.org/T262162 (10elukey) I deployed the new version and got a dashboard broken (500 from superset): ` Sep 16 09:30:14 an-tool1005 superset[4191]: ERROR:superset.app:Exception on /superset/dashboard/141/ [GET] Sep 16 09:30:14 an-tool1... [09:32:58] klausman: so we are talking about https://horizon.wikimedia.org/auth/login/?next=/ right [09:33:01] ? [09:33:31] Yep [09:33:58] so all the (?) mention wikitech's credentials, including the 2fa [09:34:28] The checklist entry mentions that I should have myself added to a project as an admin [09:34:36] "Get added as a project admin to least one cloud VPS project (your onboarding buddy can propose a suitable one). Then log into https://horizon.wikimedia.org/, create an instance and log into it." [09:34:57] My suspicion is that the credentials only work if I am admin of a project [09:34:58] yes I can add you to the analytics one, no problem [09:35:05] mmmm [09:35:15] adding you to analytics [09:37:23] (it is taking ages) [09:37:29] ok seems done [09:37:33] klausman: can you retry? [09:39:15] Still tells me "invalid credentials". Let me make 1000% sure I am using the wikitech u/p [09:44:57] So it turns out that the Horizon *requires* 2FA for login, even if it is not enabled on the WT side. I've added 2FA there, and now I can log into Horizon [09:47:33] gooood [09:48:29] I've added a request to include some language to that effect in the checklist template. [09:48:45] klausman: one qs about the rocm dkms drivers - are we planning to make a separate package for specific kernels etc.. or do you think that the upstream one is enouhg? [09:52:29] I think it's enough for now. We won't be (re)installing the package all the time, or on many machines, so the compile overhead is small enough to not really warrant the added up-front work and maintenance, I think. [09:52:51] If we had like 50+ machines or updated every week, it would be a different matter. [09:53:11] ack then let's add it to the task so others can comment etc.. [09:53:45] Roger, on it [09:54:09] we'll have to install those also on the 6 new hadoop workers soon, SRE is working on adding 4.19 kernel support on stretch via puppet [09:54:21] (we can't really move to buster yet) [09:57:50] that already exists, you can simply include the profile::base::linux419 [10:00:53] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 (10klausman) Looking at the install procedure for the rocm upstream drivers, we considered turning the DKMS package (compiling the... [10:01:43] moritzm: ah nice I was seeing patches but didn't know it was already there, good [10:02:27] klausman: second thing is to schedule the upgrade of stat1008, I think we are good to go [10:08:38] Sure, I'll send out a mail after right now and do the update after lunch [10:12:10] very nice [10:12:31] Sent. Now to hunt down some food. [10:13:04] 10Analytics-Clusters: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10elukey) @Nuria @Ottomata I think that this could be a good second task for Razzi, since it needs some review of Oozie and how it currently works. Thoughts? [10:35:55] * elukey afk! lunch [11:13:00] (03PS1) 10GoranSMilovanovic: minor 20200815 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/627806 [11:13:16] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor 20200815 [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/627806 (owner: 10GoranSMilovanovic) [11:28:56] !log starting to upgrade to rock-dkms driver on stat1008 [11:29:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:54:22] elukey:how the hell do you reset a machine using drac? I just spent half an hour using `help` and Googling, and... nothing [11:57:53] serveraction powercycle [11:57:57] christ [12:09:19] And we're back [12:11:34] !log stat1008 updated to use rock/rocm DKMS driver and back in operation [12:11:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:22:44] have you used cook books yet? https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks we have one (sre.hosts.reboot-single) which is useful here; it schedules downtime, notifies the -operations IRC channel and triggers the reboot [12:29:41] Ah, neat. I was aware of cookbooks, but not that one [12:31:01] The nasty aspect here was that migrating from one AMDGPU driver to the other hoses the system to the point it won't reboot :-/ [12:31:19] Some PCI-E shenanigans during depmod(!) [12:31:41] what could go wrong :-) [12:38:27] klausman: there is also https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_Documentation [12:38:54] so from cumin something like [12:39:10] ssh root@stat1008.mgmt.eqiad.wmnet (management pw in pwstore) [12:39:37] then you are in the drac, and you can 'racadm serveraction powercycle' to reboot, or 'console com2' to attach to the serial [12:40:33] (more manual) [12:41:14] ah ok reading it better, you just needed the powercycle action okok [12:41:22] nevermind, ETOOVERBOSESORRY [12:41:23] :D [12:42:36] (03CR) 10Joal: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/627640 (https://phabricator.wikimedia.org/T251814) (owner: 10Nuria) [12:42:56] elukey: it's alright (and appreciated :)) [13:08:22] elukey: I'll close T260442, we should make a new task for update-to-3.7 considerations [13:08:23] T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 [13:08:48] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 (10klausman) 05Open→03Resolved [13:09:21] klausman: usually what we do is move the task (in kanban) to "Done" and assign points to it [13:09:29] Oh, oops. [13:09:56] (we use fib numbers, and we have a high level map to connect points with timing spent on the task) [13:10:00] (not sure where) [13:10:01] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 (10klausman) 05Resolved→03Open [13:10:30] 3 is a trivial change in docs, 5 is a quick 1 or two hours fix, this task is probably a 8 [13:10:41] (you can set it via Edit in phab) [13:11:09] Should I only fill the Final SP field or both FInal and Estimated? [13:12:10] Final is fine [13:12:29] (Nuria then do a final pass and closes them) [13:13:35] Alright, done [13:13:38] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 (10klausman) [13:24:38] hello team :] [13:26:43] (03CR) 10Mforns: [V: 03+2 C: 03+2] Doubling threshold to reduce false positive alarms [analytics/refinery] - 10https://gerrit.wikimedia.org/r/627640 (https://phabricator.wikimedia.org/T251814) (owner: 10Nuria) [13:34:11] I just found https://github.com/PrefectHQ/prefect [13:34:15] (from a friend) [13:35:59] heyall [13:37:57] elukey: as alternative to airflow? :D [13:38:02] hey milimetric :] [13:38:36] mforns: yeah, it looks nice, but not sure how feasible it is of course, I just looked at the gh page :) [13:39:15] loooks interestinngg [13:41:52] dynamic DAGs! niiiice [13:42:30] maybe we should repeat the airflow POC with prefect [13:50:48] PROBLEM - Hue CherryPy python server on an-tool1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue runcherrypyserver https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration [13:51:20] this is me --^ [13:51:33] I love the in https://docs.prefect.io/ under "prefect" there is "don't panic" [13:53:04] PROBLEM - Hue Kerberos keytab renewer on an-tool1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue kt_renewer https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration [14:00:21] elukey: it seems thought that some parts of the software are not strict open-source. Prefect core is, but the ui and the server have their own license, that does not allow using the software for SAAS. [14:00:29] *though [14:01:17] ah snap, that is a hard blocker then [14:01:23] didn't see it [14:02:01] is it? [14:02:51] it must be like what we are going through with the confluent license and kafka connect [14:03:11] in theory we are not a SAAS but the license is not open source [14:03:14] (in the true sense) [14:03:18] aha [14:03:49] hm [14:10:54] * elukey coffee [14:12:06] their UI is vue [14:12:47] we *need* to get over this licensing model. It's the business model of open source software companies, we're sinking on the titanic and refusing to get saved by a helicopter because we don't believe in flying [14:13:51] :] [14:14:14] we also kind of don't need the UI, though it's nice to have [14:14:33] not sure, there are alternatives with good license models, like apache projects offer [14:14:41] mforns: did you find what their workflow engine is? Like Airflow uses celery but I can't see anywhere what Prefect uses other than it'll run on k8 [14:14:51] I am not sure that we need to give up just to have our life easier [14:14:59] there are alternatives to those tools [14:15:23] my point is, those alternatives are going to get increasingly worse. Because there's no business model behind them [14:15:48] strongly disagree, see superset/kafka/druid/etc.. [14:15:58] exactly :) [14:16:09] I am not following [14:16:25] kafka went to partial anti-SaaS licenses [14:16:27] they have been constantly better over time [14:16:38] most of the ecosystem around druid went closed source [14:16:52] nope kafka is an apache project, some confluent specific tools have the anti-saas license, like kafka connect [14:16:53] and superset depends heavily on AirBnB which is in serious trouble right now [14:17:24] superset has its own company now behind (Preset) that also offerts superset as service [14:17:41] heh, yeah, and how long before they go anti-SaaS as well? [14:18:12] the beauty of being an apache project is that you cannot really do it once you are in [14:18:24] you can create tools that have your own license [14:18:39] how did Pivot fork and re-license then? [14:18:54] it was not an apache project [14:19:06] oh you mean as in ASF, not apache 2.0 licensed [14:19:09] yes [14:21:22] ok, so I'm not disagreeing on the core software. But increasingly the ecosystems around these tools will adopt an anti-SaaS model. And the option is between re-inventing that wheel ourselves or adopting that license. For me, it comes down to seeing zero problems with an anti-SaaS license, like, that makes perfect sense. If you want an SaaS, you should use the company that's putting tons of money and resources into providing [14:21:22] that. [14:26:31] re-inventing the wheel is surely not something viable in the long term, but most of the times it is a matter of just picking up another tool for the job, even if it is not exactly as the one with the non open license [14:27:09] maybe a bigger problem with Prefect: it currently only supports Dask distributed executors: https://docs.prefect.io/api/latest/engine/executors.html [14:29:48] I really don't understand how anti-SaaS is not open. Isn't that mentality supporting the right of companies to be FOSS parasites? [14:45:23] Well, are companies who use OSS to do SaaS with their own unreleased changes exercising a right that is covered by the philosphy of Free Software or not? [14:45:30] I don't have an easy answer [14:46:56] I think the problem is also defining what "anti-Saas" means. On paper we all agree that informally it makes sense, but I fear that a lot of subtle things happen when thoughts are translated into legal terms [14:47:23] like what a SaaS is, and how a license is breached, etc.. [14:47:42] Yeah. And nobody wants to drag any of this to a court, either [14:47:49] exactly [14:54:22] 10Analytics: Separate RSVD anomaly detection into a systemd timer for better alarming with Icinga - https://phabricator.wikimedia.org/T263030 (10mforns) [14:54:54] That makes sense, these licenses should be crystal clear. Maybe the problem is that we need a neutral universal anti-SaaS license. But the alternative is that Amazon just takes what you wrote, puts it on AWS, and kills you. So I'm very against that... [14:55:08] (more against that than against trying to figure out the messy anti-SaaS licenses) [14:56:00] and I do think that if there was a decent universal license here, it would allow companies to act more in good faith, like not keeping some parts of their source closed and so on [14:56:01] It's also a bit odd to distinguish between "AMZN runs my unchanged code and makes money" from "AMZN took my code, added stuff, and now runs it to make money" [14:56:30] Like, I'd get the "NC" part in Creative Commons being integral. [14:56:55] But the GPL does not have that (and neither does the Affero GPL, which was made specifically for SaaS things) [14:57:48] And I figure if you offered those changed or unchanged services for free, people would mind less. So it's difficult to know *what exactly* people find objectionable. [14:58:51] I wonder if there could be like a "non-commercial with exceptions granted on a case-by-case basis" license [14:59:14] so then the license literally says in the text "non-commercial except: Google, Wikimedia Foundation, Apple" [14:59:35] but no AMZN, they can suck it [15:00:11] I'm not that anti-AMZN, I still sadly buy stuff on there [15:05:20] milimetric: how bad you want to do this? https://phabricator.wikimedia.org/T253069 [15:06:01] mforns: oh if you want to do it, I don't want to do it at all [15:06:08] (I just want it to be done) [15:06:21] but if you want I can be super-critical in the code review :) [15:06:42] hehe, no, just checking, because I was looking for a MEP task for this upcoming quarter [15:07:03] but I didn't want to grab sth that you wanted to do [15:09:24] then... ok, I'll grab that one, and please, nit-pick at will in CR [15:10:45] mforns: cool, looking forward to it [15:10:56] :] [15:11:02] actually mforns I'm having quite a bit of trouble figuring out a way to sanitize this data, wanna brainbounce in cave for a bit? [15:11:15] sure omw [15:11:20] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10MW-1.35-notes (1.35.0-wmf.27; 2020-04-07), and 2 others: Set up an instance of EventStreams in beta that will allow for consuming any stream - https://phabricator.wikimedia.org/T253069 (10mforns) a:05Ottomata→03mforns Assigning this task to me afte... [15:24:51] heya elukey razzi klausman i moved analytics ops sync to after standup to not conflict with the PA sync [15:25:36] Well, my day is going to be super-late anyway, so it makes no difference :-/ [15:26:55] ack! [15:27:49] !log update the TLS backend certificate for Analytics UIs (unified one) to include hue-next.w.o as SAN [15:27:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:29:23] klausman: ya and ops sync is optional too, we can be very flexible on that [15:29:27] feel free to skip :) [15:29:59] Well, I have a late meeting 'til 20:00, so it makes little difference. [15:30:13] "ITS Orientation: Security" [15:31:44] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:02:49] (dear batcave: we'll be right there) [16:02:55] ack [16:06:46] razzi: pingggggoooooo [16:25:52] (03CR) 10Elukey: [C: 03+1] Update oozie jobs replaceAll function quotes [analytics/refinery] - 10https://gerrit.wikimedia.org/r/626930 (owner: 10Joal) [16:26:34] thanks elukey --^ [16:26:38] joal: thank you! [16:26:41] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/626930 (owner: 10Joal) [16:31:55] milimetric: tardis (+link?)? [16:32:14] there joal https://meet.google.com/kti-iybt-ekv?pli=1 [16:32:21] (bit.ly/tardis I think) [16:37:09] 10Analytics-Radar, 10Datasets-General-or-Unknown, 10Product-Analytics, 10Structured-Data-Backlog: Set up generation of JSON dumps for Wikimedia Commons - https://phabricator.wikimedia.org/T259067 (10Ramsey-WMF) @Cparle could be your apprentice 😃 >>! In T259067#6461361, @ArielGlenn wrote: > I'll put it in... [16:57:40] 10Analytics-Clusters, 10Analytics-Radar, 10User-Elukey: Monitoring GPU Usage on stat Machines - https://phabricator.wikimedia.org/T251938 (10Ottomata) a:03klausman [16:59:46] 10Analytics-Clusters: [Spike] Explore goblin as an alternative to camus - https://phabricator.wikimedia.org/T252560 (10Ottomata) [16:59:48] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Evaluate possible replacements for Camus: Gobblin, Marmaray, Kafka Connect HDFS, etc. - https://phabricator.wikimedia.org/T238400 (10Ottomata) [17:24:51] going afk! [17:28:07] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10CDanis) hey @jlinehan -- do you have any concerns about me re-routing the DNS of `intake-logging.w... [17:28:34] I've gotta run out to the dentist with the kids (forgot!) so I'll be afk for a bit [17:36:32] mforns: train etherpad pleasE? [17:36:53] joal: someone already added what I wanted to add [17:36:56] soooo [17:37:00] Meh? [17:37:07] hehehe [17:37:11] what do you need? :] [17:38:05] mforns: I can't see any data-quality line in the "next train" section :( [17:38:21] oh, lookin [17:38:46] oooh, I was looking at the previous deploy [17:38:53] it's the same, will copy [17:39:25] need to restart the 2 jobs? [17:41:34] joal, done, yes both [17:41:43] wait... [17:41:43] ack mforns - thanks a lot :) [17:41:45] sure [17:41:46] joal: no, you're right [17:41:52] only hourly! [17:42:00] modifying [17:42:46] ok, now is good! [17:42:50] \o/ [17:42:53] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Krinkle) For EventLogging, we specifically moved away from separate domains to using `/beacon` so... [17:42:53] :] [17:43:07] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10jlinehan) @CDanis this should be fine, the errors are fire-and-forget so this shouldn't cause any... [17:44:00] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10CDanis) Great, thank you! That was my thinking as well, but I wanted to confirm. [17:44:12] ottomata: I confirm manual check of max(event data raw timestamp) matches the _REFINE value [17:44:31] Ok - deploying refinery now - no need to deploy refinery-source, light train [17:44:49] great htank you joal [17:45:41] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10jlinehan) >>! In T226986#6467370, @Krinkle wrote: > For EventLogging, we specifically moved away f... [17:46:35] !log Deploy refinery using scap [17:46:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:54:02] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Krinkle) DNS is relatively quick and well-cached on-device, local network, and in middleware netwo... [17:59:17] !log Deploy refinery onto HDFS [17:59:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:04:55] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10jlinehan) >>! In T226986#6467398, @Krinkle wrote: > DNS is relatively quick and well-cached on-dev... [18:06:24] 10Analytics, 10Analytics-EventLogging, 10Wikimedia-production-error: OperationError: The operation failed for an operation-specific reason in generateRandomSessionId - https://phabricator.wikimedia.org/T263041 (10matmarex) [18:09:13] PROBLEM - Hue Kerberos keytab renewer on an-tool1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue kt_renewer https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration [18:10:31] PROBLEM - Hue CherryPy python server on an-tool1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue runcherrypyserver https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration [18:14:44] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Ottomata) Huh, I'm pretty sure I discussed this with bblack and/or traffix team when we were first... [18:16:46] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10jlinehan) >>! In T226986#6467482, @Ottomata wrote: > Huh, I'm pretty sure I discussed this with bb... [18:17:09] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Ottomata) Also relevant: {T262996} ? [18:17:44] 10Analytics-Radar, 10Better Use Of Data, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, and 7 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10Ottomata) Right, but I asked originally which we should do, host at same wiki domain at /beacon, o... [18:44:26] !log Kill restart mediawiki-history-reduced job after deploy [18:44:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:46:21] 10Analytics, 10Event-Platform, 10Performance-Team, 10Product-Infrastructure-Team-Backlog: Research and consider network connections made due to Event Platform - https://phabricator.wikimedia.org/T263049 (10Krinkle) [18:57:49] !log Kill-restart webrequest after deploy [18:57:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:00:12] 10Analytics, 10Event-Platform, 10Performance-Team, 10Product-Infrastructure-Data: Research and consider network connections made due to Event Platform - https://phabricator.wikimedia.org/T263049 (10jlinehan) [19:00:55] !log Kill-restart data-quality-hourly bundle after deploy [19:00:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:01:00] (I'm back) [19:09:22] joal: https://gerrit.wikimedia.org/g/mediawiki/extensions/EventBus/+/4fd2ddbfedc1e984e11b697378db7b08bd208135/includes/EventBusHooks.php#93 [19:09:26] * milimetric jumps for joy [19:09:41] basically, it's super easy to add [19:09:44] \o/ hurray :) [19:10:32] * joal is upset after cluster user jobs, as trying to be nice in restarting jobs actually causes infinite waut [19:10:34] and it seems to me that these hooks generally happen *after* an action, and they get a log entry. So if it's the case here, I don't see why it wouldn't be the case everywhere else [19:10:52] milimetric: if it's a positive pattern, I take it :) [19:12:00] yep, filing task now to add a log entry to each page and (future) user events [19:12:52] !log Manually kill webrequest-hour oozie job that started before the restart could happen (waiting for previous hour to be finished) [19:12:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:22:12] milimetric: yo yt? [19:22:22] razzi and are looking at https://phabricator.wikimedia.org/T259307 [19:22:22] but [19:22:27] yes [19:22:31] i don't see a wikistats node in the analytics project [19:22:34] did we delete it? [19:22:56] yes, we did [19:23:11] that was there just to test that symlink from v2 to v1 thing that we were bikeshedding about forever [19:23:16] ok great [19:24:19] 10Analytics, 10VPS-Projects, 10Puppet: Puppet failing on wikistats.analytics.eqiad.wmflabs due to statistics::user - https://phabricator.wikimedia.org/T259307 (10razzi) 05Open→03Resolved This node was deleted. [19:24:22] 10Analytics, 10Analytics-EventLogging, 10JavaScript, 10Wikimedia-production-error: OperationError: The operation failed for an operation-specific reason in generateRandomSessionId - https://phabricator.wikimedia.org/T263041 (10Umherirrender) [19:39:33] 10Analytics, 10Analytics-Kanban, 10Platform Engineering: Add log entry details to page and user events in EventBus - https://phabricator.wikimedia.org/T263055 (10Milimetric) [19:46:07] 10Analytics, 10Analytics-Kanban, 10Platform Engineering: Add log entry details to page and user events in EventBus - https://phabricator.wikimedia.org/T263055 (10Pchelolo) Could you expand a bit in what MySQL data are you trying to correlate the events to? It feels conceptually wrong to add ‘log_id’ to an ev... [19:47:49] 10Analytics, 10Analytics-Kanban, 10Platform Engineering: Add log entry details to page and user events in EventBus - https://phabricator.wikimedia.org/T263055 (10Pchelolo) Btw, We can probably add a log event, similar to a recentchange event.