[05:03:29] 10Analytics, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) [05:03:35] 10Analytics, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) p:05Triage→03High [06:03:52] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10elukey) The main problem in doing the yarn patch first is that new applications submitted IIUC will fail since they will not be accepted by Yarn, causing fa... [06:13:00] 10Analytics: Failures registered by drop_event on an-launcher1002 - https://phabricator.wikimedia.org/T283126 (10elukey) [07:19:58] 10Analytics, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi @Ottomata you can use db1125 to replace this host. Most likely it needs to be renamed to dbstore1006: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging Let me know w... [09:02:36] 10Analytics, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10LSobanski) Timing wise, this should happen before DC switchover happens (likely the week of June 21st) as we'll have our hands full during that time. This makes things tricky as the three weeks before that date... [10:00:08] 10Analytics-Radar, 10Analytics-Wikimetrics, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Archive analytics-wikimetrics (deprecated by Event Metrics) - https://phabricator.wikimedia.org/T219334 (10Aklapper) 05Stalled→03Open [10:00:11] 10Analytics, 10Analytics-Kanban, 10SRE: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10Aklapper) [10:30:05] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10akosiaris) >>! In T277012#7096742, @Ottomata wrote: > @Joe @akosiaris q: > > I know using docker images in prod outside of k8s is not really done, but...could we? I woul... [10:57:51] 10Analytics, 10serviceops, 10User-jijiki: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10jijiki) @akosiaris @Ottomata should we resolve this in favour of T282148 ? [10:58:00] 10Analytics, 10serviceops: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10jijiki) [11:06:33] 10Analytics, 10serviceops: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10akosiaris) >>! In T242861#7098031, @jijiki wrote: > @akosiaris @Ottomata should we resolve this in favour of T282148 ? +1 [12:19:06] 10Analytics, 10serviceops: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10jijiki) 05Stalled→03Resolved Continued in T282148 [12:44:48] joal: looks like you are busy but let's chat sometime about shared data platform stuff [12:45:07] Heya ottomata - kids are sleeping, I have some time now if you wish [12:45:16] ok! bc [12:45:20] OMW [13:39:42] joal: oh, also, i wanted to ailgn because we should utalk with ML folks about this too [13:40:02] i'll make some doc edits and schedule that sometime [13:40:18] fro sure ottomata! We'll need to talk more precisely about feature store as well, ut the global ideas are shared :) [13:42:15] 10Analytics: Failures registered by drop_event on an-launcher1002 - https://phabricator.wikimedia.org/T283126 (10Ottomata) I'll paste what I wrote in analytics-alerts in response to this: I reran this manually last night with no issue. It looks like somehow a bad table name is making it into the list of tables... [13:43:15] 10Analytics, 10Analytics-Kanban: Failures registered by drop_event on an-launcher1002 - https://phabricator.wikimedia.org/T283126 (10Ottomata) a:03Ottomata [13:45:18] 10Analytics, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Ottomata) What's the urgency of 85% full? Could it wait until Q2 maybe? [13:47:14] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) > obtaining access to the docker socket equals root on the machine This is a big one, ok sounds good,. > it's not as easy as just running docker run myimage mya... [13:47:32] 10Analytics, 10serviceops: Clarify multi-service instance concepts in helm charts and enable canary releases - https://phabricator.wikimedia.org/T242861 (10Ottomata) +1 [13:50:50] RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:50:56] 10Analytics, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) >>! In T283125#7098281, @Ottomata wrote: > What's the urgency of 85% full? Could it wait until Q2 maybe? If it increases 3% more, that means we'll not be able to alter the `image` table anymore. At... [13:59:35] a-team, if anyone wants to get into some exciting data model naming discussions [13:59:36] https://phabricator.wikimedia.org/T281499#7085780 [13:59:58] that is for WM enterprise, but they are cooporating with arch on knowledgee store model. nothing defined at all [14:00:00] but discussions. [14:00:20] i've got a meeting with them tomorrow to diiscuss my comment about capitalization in data fields [14:10:22] joal: any objection to changing the words we use in shared data platfomr doc from 'Collect, Process, Serve' to 'Source, Transform, Serve'? [14:10:27] those words make more sense t me [14:11:22] no problem for me ottomata [14:15:42] ottomata: source, transform, and serve look great to me semantically, the only (nit-picky) thing I would point out is: ideally, if you have 3 terms that express different aspects of a thing, you want them to be lexically different as well, and in this case source and serve are somewhat similar, no? [14:19:43] oh ha! [14:19:43] hm [14:19:53] we coudl call it load? [14:20:26] i like source though because it includes 'event sourcing' [14:20:34] but serve is so much better than load semantically no? [14:20:43] I see [14:26:17] ottomata: but event sourcing refers not only to collection but more to the way we store information no? state-store vs action-store? [14:28:31] anyway, those are just nit-picky opinions, I'm actually fine with Source Transform Serve :] [14:32:29] 10Analytics, 10Analytics-Kanban: Superset Presto LIMIT >10000 error - https://phabricator.wikimedia.org/T282632 (10Milimetric) @SNowick_WMF, Reportupdater is fine, it's what's available right now. We don't want to slow you down waiting for AirFlow [14:32:53] mforns: event sourcing refers to how you use event data to propagate state changes [14:33:14] that's not all 'source' means here in the doc, but it includes it [14:33:33] event sourcing also has a meaning around 'source of truth', i.e. what is the canonical state data [14:33:40] aha [14:33:54] i'm trying to capture that too, what is the 'canonical' source of dataset X or dataset Y [14:34:23] everything else that has dataset X is just a copy but somewhere is the canonical aka source copy [14:34:30] understand, yea, I like 'Sourcing' too [14:40:31] ottomata: 'publish' instead of 'serve'? [14:40:39] mforns, ottomata - I also like the idea of us having multiple words to descirbe the things - One would be the main one (title), and the others help for refineing - for instance Source/Collect/Ingest -- Transform/Process/??? -- Serve/Present/Publish [14:41:21] joal: +1 [14:44:39] 10Analytics, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Ottomata) Ok, so db1125 is available for reimaging now? I'll bring this up in our standup today, and see if we can get to work on it next week or after. [14:46:38] 10Analytics, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Yes, it can be done anytime. [14:54:01] +1 [14:54:20] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10akosiaris) >>! In T277012#7098288, @Ottomata wrote: >> obtaining access to the docker socket equals root on the machine > > This is a big one, ok sounds good,. > >> it's... [15:03:43] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) > probably a simple use case Ah, this is a simple use case too. We'd configure the service with puppet. Surely all the k8s infra we have is better, and we'd use... [15:04:30] 10Analytics, 10Analytics-Kanban, 10Packaging: Create a debian package for Apache Airflow - https://phabricator.wikimedia.org/T277012 (10Ottomata) Airflow would actually be perfect for k8s, it is stateless (all state is in a DB), if only it could work with Kerberos! [15:14:49] I can't make standup today I'm afraid [16:03:22] a-team standup! [16:03:23] a-team standuppppp?? [16:08:59] 10Analytics, 10Analytics-Kanban: Article missing from the Clickstream dataset - https://phabricator.wikimedia.org/T282178 (10JAllemandou) 05Open→03Resolved Resolving for now - please reopen if needed :) [16:26:18] razzi or ottomata: do you have some mins today to help me with https://gerrit.wikimedia.org/r/c/operations/puppet/+/692909 ? [16:27:28] mforns: I can try! Is the issue that the puppet code needs refactoring? [16:28:00] razzi: jenkins is complaining about the import [16:28:35] I think I made a conceptual mistake, importing refinery from an unrelated module [16:29:06] not sure how to approach that, though, I couldn't find any examples of similar things in puppet [16:29:28] reportupdater is currently independent from refinery [16:29:53] but I need refinery/bin/hdfs-rsync script to rsync logs to HDFS [16:29:55] my 2c: I'd add a parameter to the class to get the refinery path, and then pass it from the ru profile (that can import ::profile::analytics::refinery and get the path) [16:30:14] aaaaaah [16:30:20] or where the reportupdater class is instanciated [16:30:29] and can the reportupdater class assume hdfs dfs is there? [16:31:29] !log restart turnilo for T279380 [16:31:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:31:32] T279380: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 [16:31:52] mforns: mmm onething - would it be ok to move the timer to the RU puppet profile? [16:32:23] elukey: sure [16:32:30] because it requires require ::profile::analytics::cluster::packages::hadoop etc.. [16:32:37] so you'd be set [16:32:44] there's no reportupdater profile though [16:32:44] I mean profile::reportupdater::jobs [16:32:49] ok ok [16:33:02] ok, will try, thanks! [16:33:18] mforns: ah remember that the systemd commands need full path for commannds [16:33:35] so profile::reportupdater::jobs [16:33:39] err /usr/bin/hdfs [16:34:06] and it is not bash, so the ; etc.. might not work [16:34:15] ah... [16:34:26] there is something that you can use though [16:34:31] is there a way to ensure presence of a directory in HDFS? [16:34:42] bigtop::hadoop::directory [16:34:51] there are some examples in puppet [16:35:10] oh, but that would be not synchronized with the timer runs, right? [16:35:25] the directory could be deleted between puppet run and timer run [16:35:31] 10Analytics, 10Analytics-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) [16:35:32] you can require => Bigtop::Hadoop::Directory[...] [16:35:38] ok! [16:35:58] does this still require an ensure presence? or is it explicit? [16:36:05] I mean implicit [16:36:56] it is 'present' by default, but if you want to make it dependent on the main ensure of the profile it is ok [16:37:23] (with absent it does a hdfs rm skip trash, pretty brutal but works :D) [16:37:32] 10Analytics, 10Analytics-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10JAllemandou) 05Open→03Resolved The new field is in turnilo with data starting from May 18th 2021. https://w.wi... [16:38:49] mforns: I have some doubts on the approach though, what about file permissions? [16:39:49] elukey: these are plain logs without any PII, I thought they could be public for all users? [16:39:53] an alternative could be to force rsyslog to ship RU logs to logstash [16:41:14] ah they all have the other perms bits (checked on an-launcher1002), I was only wondering that bit [16:41:21] elukey: that would be great! [16:41:39] I know that we do it for syslog [16:41:42] for all the hosts [16:41:52] but not sure if it can be done selectively [16:41:56] aha [16:43:19] probably not :( [16:43:47] ok, so rsync from the jobs profile? [16:44:35] yes yes it is easier [16:50:28] 👍 [16:55:58] 10Analytics, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10razzi) a:03razzi Thanks for calling this out @Marostegui and offering db1125. I'll get started on the reimage of db1125. [16:58:30] 10Analytics, 10Analytics-Kanban, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10razzi) [17:00:54] joal: i can't compile refinery? [17:01:03] cannot find symbol [17:01:03] [ERROR] symbol: class ColumnFamilyOutputFormat [17:01:19] this is on stat100o7o [17:01:39] oh gm [17:01:40] hm [17:01:44] oh sorry i have untracked files [17:01:47] never mind! [17:12:24] Typing is complicated :) https://counterexamples.org/title.html [17:13:41] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Aggregate table not working after superset upgrade - https://phabricator.wikimedia.org/T280784 (10razzi) 05Open→03Resolved [17:14:02] 10Analytics, 10Analytics-Kanban: Remove request for font.googleapis.com from analytics.wikimedia.org - https://phabricator.wikimedia.org/T182804 (10razzi) 05Open→03Resolved [17:14:05] 10Analytics, 10Product-Analytics, 10Epic: Revamp analytics.wikimedia.org data portal & landing page - https://phabricator.wikimedia.org/T253393 (10razzi) [17:18:22] (03PS1) 10Ottomata: ProduceCanaryEvents - produce events one at a time for better error handling [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) [17:31:05] (03PS2) 10Ottomata: ProduceCanaryEvents - produce events one at a time for better error handling [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) [17:31:37] mforns: if you have a moment could you review https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/692934 [17:31:38] ? [17:31:46] lookin [17:32:18] (03PS3) 10Ottomata: ProduceCanaryEvents - produce events one at a time for better error handling [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) [17:32:40] (03PS4) 10Ottomata: ProduceCanaryEvents - produce events one at a time for better error handling [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) [17:34:13] (03PS5) 10Ottomata: ProduceCanaryEvents - produce events one at a time for better error handling [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) [17:38:01] (03CR) 10Mforns: "LGTM! Just a couple minor comments, please ignore if they don't make sense." (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) (owner: 10Ottomata) [17:45:49] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Request for access to analytics-privatedata-users - https://phabricator.wikimedia.org/T283190 (10schoenbaechler) [17:59:09] (03PS6) 10Ottomata: ProduceCanaryEvents - produce events one at a time for better error handling [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) [17:59:11] (03CR) 10Ottomata: ProduceCanaryEvents - produce events one at a time for better error handling (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) (owner: 10Ottomata) [18:00:15] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) (owner: 10Ottomata) [18:53:04] ottomata: quick question - is the patch for missing revision event live? [18:53:28] ottomata: I see that blocked train is wmf.6 - does that mean that .5 is live? [19:05:50] (03CR) 10Ottomata: [C: 03+2] ProduceCanaryEvents - produce events one at a time for better error handling [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/692934 (https://phabricator.wikimedia.org/T270138) (owner: 10Ottomata) [19:07:51] 10Analytics-Clusters, 10Discovery-Search (Current work), 10Patch-For-Review: Missing hourly partition for event.mediawiki_revision_recommandation_create - https://phabricator.wikimedia.org/T283084 (10Ottomata) ^ will be deployed next week, that should keep this from happening again. [19:08:29] joal: its live! it went out with .5 [19:08:29] yes [19:08:40] https://versions.toolforge.org/ [19:43:05] 10Analytics, 10Analytics-Kanban, 10Event-Platform: produce_canary_events job should not fail if a schema is missing examples - https://phabricator.wikimedia.org/T270138 (10Ottomata) [20:01:29] 10Analytics-Clusters, 10Analytics-Kanban, 10SRE, 10Patch-For-Review: Switch kafka/Hadoop away from java::security - https://phabricator.wikimedia.org/T282454 (10Ottomata) Had to revert Kafka change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/692661 Can un-revert after {T279342} is done. [20:08:58] 10Analytics, 10Analytics-Kanban, 10DBA: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `db1125.eqiad.wmnet` - db1125.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found physi... [20:17:17] 10Analytics, 10User-ArielGlenn: Spike [2019-2020 work] Oozie Replacement. Airflow Study / Argo Study - https://phabricator.wikimedia.org/T217059 (10Ottomata) [20:17:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Spike: POC of refine with airflow - https://phabricator.wikimedia.org/T241246 (10Ottomata) [20:18:15] 10Analytics, 10Product-Analytics, 10Epic: Replace Oozie with better workflow scheduler - https://phabricator.wikimedia.org/T271429 (10Ottomata) [20:18:26] 10Analytics, 10User-ArielGlenn: Spike [2019-2020 work] Oozie Replacement. Airflow Study / Argo Study - https://phabricator.wikimedia.org/T217059 (10Ottomata) 05duplicate→03Resolved [20:18:54] 10Analytics, 10Analytics-Kanban: Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance - https://phabricator.wikimedia.org/T272973 (10Ottomata) [20:18:56] 10Analytics, 10Product-Analytics, 10Epic: Replace Oozie with better workflow scheduler - https://phabricator.wikimedia.org/T271429 (10Ottomata) [20:20:00] mforns: remind me, how does airflow + job dependencies work? [20:20:11] can airflow scheduler run isolated, and then launch other things? [20:20:28] for simplicilty, lets say all deps are local [20:20:35] airflow's python env is different than the job's python env [20:20:45] (or i guess the job isn't even python) [20:20:49] does that matter? [20:38:12] ottomata: hm, I never got to look into whether the DAG-interpretation env is identical to the Operator env... [20:38:34] i'm looking into building a deb and using conda for it rather than virtual env [20:38:37] My guess is that it should be, since both codes (for DAG and Operator) are in the same file [20:38:43] that will allow the python env and exec to be self contained too [20:38:52] in Airflow? [20:39:12] yes [20:39:26] so conda is like virtualenv, except that binary execs can be included to [20:39:29] so totally self isolated [20:39:41] (mostly, it does use system C libs i think) [20:39:41] how would conda work within Airflow? [20:39:53] i would make a conda env for airflow [20:39:56] and then package it in a deb [20:40:04] I see [20:40:07] so it would jsut be a self contained python env [20:40:12] ok [20:40:33] so, my q was, if someone has a dep in their job, maybe lets say numpy [20:40:40] aha [20:40:49] and that dep is not availbale to airflow scheduler [20:40:54] how does that work? [20:40:59] ebernhardson: for any tips too ^ [20:42:12] ottomata: I don't know :[ [20:43:14] I imagine if you're using a PythonOperator then the execution env is the same as the scheduler env [20:43:43] probably the code executes in another place (in our case Celery) [20:43:56] but not sure what happens [20:44:33] hm right right [20:44:38] actually...there is now a DaskOperator [20:44:45] which means tasks could run on yarn if we set that up! [20:44:53] sorry [20:44:56] dask executor [20:45:18] ottomata: https://stackoverflow.com/questions/49738173/how-to-run-airflow-pythonoperator-in-a-virtual-environment [20:45:22] ok so, as long as the DAG python file itself does not need dependencies outside of the scheduler env [20:45:25] it should be ok, right? [20:46:29] fff, actually no idea, would need to read more about it [20:47:09] As the example says, we could use a BashOperator and call any venv we want [20:47:39] ok i thiiiinik this will work then [20:47:41] ty [21:09:37] with regards to airflow and python dependencies, our policy is that nothing that does actual work should be implemented in airflow. Airflows job in our case is to submit jobs to the yarn cluster. Any dependencies go there [21:09:51] ottomata: ^