[01:06:27] 10Quarry, 10DBA, 10Data-Services: SQL requests to DB replicas became work much slower, both from Quarry and from process on Toolforge - https://phabricator.wikimedia.org/T262757 (10bd808) Plugging the query into https://sql-optimizer.toolforge.org/ gives this explain result: |id| select_type| table| type|... [05:26:31] 10Quarry, 10Data-Services: SQL requests to DB replicas became work much slower, both from Quarry and from process on Toolforge - https://phabricator.wikimedia.org/T262757 (10Marostegui) Keep in mind that Quarry has a hard limit on 30 minutes queries. The labsdbhosts themselves, they have a query killer that ki... [05:30:26] good morning [05:30:29] :) [06:16:54] I might have found some way to workaround the last issue with hue, namely the hive panel not behaving as expected (see https://github.com/cloudera/hue/issues/1264) [06:17:23] it seems due to the super old hive-server2 version that we run, and the thrift mismatch between what hue runs and what the hive server runs [06:17:32] hopefully it'll be better with hive 2.2.3 [06:18:02] anyway, I didn't find more weirdness but I would keep for some time hue.wikimedia.org (current one) and hue-next.wikimedia.org [06:18:16] or something similar if possible [06:18:32] the latter with debug logging always enabled, forcing people to use it [06:18:55] I am pretty sure we'll find more bugs / issues etc.. [06:19:14] when we are ready, we deprecate current hue [06:19:24] Cc: moritzm: --^ :) [06:19:27] does it make sense? [06:19:46] slower transition, but probably better for mental sanity [06:35:08] G [06:35:14] Good morning [06:37:17] bonjour [06:56:21] elukey: I did a search in oozie folder for 'replaceAll' after your patch in webrequest and found some other instances that could lead to python-parsing errors I think - will send a patch [06:56:25] sounds good, if hue.w.o is around as a fallback we can also be a little more disruptive in switching hue-next.w.o to CAS and fix the transfer of the uid in httpd [06:56:43] joal: ack thanks! [06:56:57] joal: as FYI I am rolling out kafka ferm rules, initially only on 1009 [06:57:22] ack elukey - I assume mjolnir has a working solution for the scaling issue? [06:57:44] joal: yes Erik gave me the green light, some stuff not yet solved but they shouldn't fall back to es [06:57:54] ack elukey [06:58:16] Thanks for the update - I'll be interested to understand the solution :) [06:58:35] joal: they used multiple instances of mjolnir on the same VM basically [06:59:21] wow - so the bottleneck was on mjolnir not being able to process fast enough, not about hardware limitations [07:00:32] yep I think so [07:00:42] Interesting :) [07:00:57] already found a host that I didn't whitelist, kafkamon1001 [07:01:00] adding to the ferm rules [07:01:48] elukey: My guess is that you'll might find some examples of those in the next minutes :) [07:02:43] yes yes I am tailing syslog, where we log deny actions [07:02:54] but 1009 is not yet very used [07:03:47] * joal imagines elukey's eyes moving as following QuickSilver [07:07:52] (03PS1) 10Joal: Update oozie jobs replaceAll function quotes [analytics/refinery] - 10https://gerrit.wikimedia.org/r/626930 [07:08:27] elukey: I have fun charts for you when you want [07:23:31] 10Analytics-Clusters: Establish what data must be backed up before the HDFS upgrade - https://phabricator.wikimedia.org/T260409 (10JAllemandou) [07:23:40] elukey: --^ [07:26:04] nice! will check in a bit, still working on the ferm rules [07:26:13] ack elukey [07:49:51] elukey: also, can you help cleaning non-bz2 files in labstore? I'm working on fixing the rsync, but cleaning up first seems a good idea [07:50:57] joal: sure [07:52:35] joal: is the rsync deleting on the target dirs? [07:52:42] I guess not but worth to double check [07:53:08] elukey: it is not, for explicit reasons (preventin deleting data in case of mistake) [07:53:18] elukey: and it's actually a good thing! [07:53:27] elukey: I just found a very interesting/weird bug [07:54:18] joal: okok I just wanted to double check [07:59:11] Morning! [07:59:23] good morning :) [07:59:26] Good morning klausman [07:59:29] :) [08:00:02] Not sure yet about the "good" part, but working on it :) [08:00:56] * joal sends positive waves to klausman [08:01:12] * klausman feels vibrated [08:08:00] 10Analytics: pagecounts-ez of month 2020-08 is incomplete - https://phabricator.wikimedia.org/T262141 (10JAllemandou) NOTE: Talking about `pagecounts-ez` folder below, not other pageview/pagecount folders. Things to discuss/fix: * Yearly projectcounts/projectviews * The dumps site references this page: https:... [08:08:06] elukey: --^ [08:09:34] joal: I only now realized that your last name sortof but not makes you German :) [08:10:28] klausman: Indeed there is fuzzy link from the name, but not from family-history :) [08:11:45] klausman: or to be more precise - family must have come from Germany at some point, but I don't know when :) [08:13:02] My mum did some ancestry tracing a while back, but couldn't go quite before around 1848. Turns out my family is from/passed through Alsace at the time [08:14:18] The February Revolution/March Revolution in fr/de apparently destroyed a surprising amount if records, both public and in churches [08:15:42] klausman: I assume it was of interest for some (maybe many!) [08:16:34] What we never figured out is why there is a whole bunch of Klausmanns in South-Western Germany, and then a tiny island near Krefeld (North-Western Germany) and very little elsewhere. [08:16:47] We have *no* idea what the connection (if any) is to the Krefeld bunch [08:18:06] Fun :) [08:28:14] so I am discovering new clients [08:28:34] the logstash nodes pull from kafka jumbo topics like the EL error one [08:28:46] * joal had guessed that :) [08:29:45] yep but nobody told me :) [08:30:30] :) [08:30:34] elukey: can I help? [08:30:56] nope for the moment I am watching logs, will need to come up with a new list [08:31:01] ack [08:31:21] elukey: let mw know when you have a minute, I have interesting results on HDFS-fsimage analysis [08:31:33] joal: better this afternoon if it is ok for you [08:31:39] np [08:32:48] klausman: to give you context, I am working on https://phabricator.wikimedia.org/T204957 - namely adding more strict ferm/iptables rules to kafka-jumbo100* nodes [08:33:16] in syslog we have a log like 'ulogd[830]: [fw-in-drop] etc..' that lists what it is dropped, very handy [08:33:17] Huh. I am not allowed to see that task [08:33:56] klausman: ok fixed :) [08:34:27] it shouldn't really be restricted, will fix it in a bit [08:34:28] I am a big fan of ulogd :) [08:35:01] anyway, I have disabled puppet on all kafka-jumbo (9 nodes) and re-enabled it only on 1009, that is newer/less-trafficated [08:35:22] and now I am checking syslog to see what it gets blocke [08:35:23] *blocked [08:35:25] So as to see if the new rules do/don't break the bix? [08:35:30] box* [08:36:06] yes exactly, we don't really know all the clients connecting, I made a list but I left some stuff out apparently [08:36:34] What ports are of interest here? I see kafka listening on a whole bunch [08:36:53] (via `sudo lsof -Pni|grep kafka.*LISTEN`) [08:36:54] 9092 is plaintext, 9093 is TLS [08:38:26] kafka clients/producers can contact any node to know who is the leader for a partition, namely the one to send traffic to [08:39:56] ack. How do you get a canonical list of "these things need to be able to talk to this instance"? [08:40:49] klausman: there is one now in puppet that I have build asking around, checking puppet, etc.. [08:41:07] you can find in in hiera, looking for role/common/kafka/jumbo/broker.yaml [08:42:53] Ugh, puppet and its syntax will never please me [08:43:15] it is also ferm syntax with a million () that hurts eyes [08:43:38] Is its author maybe a CLISP fan? :) [08:43:50] I am pretty sure yes [08:44:38] git gc [08:44:41] oops :) [08:46:55] But yeah, I have found that any complex Netfilter rule set eventually devolves into maintenance-by-squeal: you change something and wait if any service or user squeals about broken stuff [08:47:08] that is very sad [08:47:30] but the only thing that currently prevents any client in the prod network to dump any data from jumbo [08:47:52] eventually proper authz/n with kerberos will be deployed (I hope) [08:49:40] What is preventing Camus from using TLS? [08:50:18] age mostly, its codebase is abandoned (we are updating an old fork) [08:50:50] and it still uses the old kafka consumer client, that is deprecated nowadays (also not supporting basically anything) [08:51:32] we thought to upgrade the fork to use newer kafka clients, but then we decided to concentrate our efforts in finding a (supported) alternative [08:51:56] Yeah, makes sense [08:52:43] https://phabricator.wikimedia.org/T238400 [08:54:05] (coffee brb) [08:54:33] klausman: camus runs as systemd timer on an-launcher1002 [08:54:45] systemctl list-timers to see the full list of things we run from there [08:55:17] (all those timers have icinga alarms on their related .unit failures) [09:08:04] I need to read up on systemd triggers. I've avoided them for the familiar mess that is cron [09:08:12] timers* [09:14:33] I found it really nice compared to crons, especially for commands like systemctl list-timers [09:14:43] the first time that I saw it I was almost crying [09:15:03] (considering the gigazillion crons that we are/were running) [09:15:08] My main annoyance with cron is overruns and how it handles (or rather: doesn't) failure and retries [09:15:42] Especially after using a cluster-level cron at Google which had almost all of the problems fixed :) [09:15:59] :) [09:16:19] systemd should be smart enough to avoid triggering another time a unit if it is currently running [09:16:26] it doesn't solve all use cases but it is a start [09:18:01] It's just neat when you can tell describe a cronjob as: "Run this once a month, in the first week of the month, whenever there's capacity. It is not restartable. If it fails, send a message through Subspace (internal notification system for projects), then try the run again, but never more than three times, and never outside of the first week. Run #1 should use priority 25 (batch), but [09:18:03] you can escalate to p115 for #2 and #3." [09:18:35] yeah that is really nice :D [09:20:47] There's also a bunch of flexibility in locality, i.e. "run near my data" [09:29:04] elukey: is there any way i can help with the Netfilter stuff? [09:35:55] klausman: definitely, if you have time [09:36:14] I can give you the summary of what I have done so far [09:36:28] disabled puppet on all jumbos except kafka-jumbo1009, where I ran puppet [09:36:43] (so it runs the new firewall rules) [09:36:59] then I checked in /var/log/syslog for fw-in-drop entries [09:37:18] some ipv6 issues are "known", namely: [09:38:30] - centrallog1001/2001 DNS records don't have AAAA records yet, so they try to connect via IPv6 getting refused. In theory they should fall back to ipv4, but we'll have to be careful. On the centrallog nodes we run "kafkatee", basically a kafka consumer that dumps to file. [09:38:46] (SRE use it to log some webrequest data, like sampled + 503s) [09:39:29] - netflow[3,4,5]001 the hosts didn't have AAAA records, I fixed it with https://gerrit.wikimedia.org/r/c/operations/dns/+/627195 [09:40:38] then we have some "logstashXXXX" nodes, from SRE: they run logstash, that in turn pulls from kafka [09:40:57] I wasn't aware of this use case, and in puppet there is no label to group them [09:41:34] I see. What's involved in making a label? [09:41:55] (we have some labels in hiera, check for example git grep kafka_brokers_jumbo, hieradata/common.yaml) [09:42:09] in theory we are discouraged to use those more [09:42:30] because they tend to be a maintenance nightmare over time, especially if they grow [09:44:24] I am currently trying to make the horrible profile::kafka::broker::custom_ferm_srange a little more maintainable [09:51:55] namely https://gerrit.wikimedia.org/r/c/operations/puppet/+/627241 [09:52:38] I'll habve a look-see [09:52:52] I am running the puppet compiler now, in theory should be a no-op [09:54:15] of course I cannot use + in puppet for strings [09:54:18] always forget [09:54:19] fixing [09:57:46] I wonder if there is a way to factor out the repeated resolve() calls from that list [10:00:12] there might be, not sure [10:00:27] so pcc shows a no op, I think that the list is way more readable now [10:01:14] ok to merge? [10:03:07] just to be sure, I ran this on cumin1001 [10:03:08] sudo cumin 'c:profile::kafka::broker and not kafka-jumbo*' 'disable-puppet "elukey"' [10:03:26] since I touched profile::kafka::broker, that it is shared among multiple nodes/roles [10:06:06] now I am running [10:06:06] sudo cumin -m async 'kafka-main2001*' 'enable-puppet "elukey"' 'run-puppet-agent' [10:08:26] (then I am doing the rest, slowly, all noops) [10:08:40] (this is usually not needed but my paranoia level is higher on monday) [10:14:55] ok all deployed :) [10:15:09] Sorry, I completely failed to hit return on this: [10:15:34] "Yeah, LGTM. Mind that I am no expert in puppet nor ferm." [10:16:31] me too! :D [10:16:50] ok so the next could of hosts that I found are webperfXXXX nodes [10:17:32] they run profile::webperf::processors [10:17:39] that uses kafka-jumbo [10:20:26] and I'll also add the logstash nodes one by one [10:20:38] not great but for the moment it seems the only viable option [10:22:43] but there are a lot sigh [10:25:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/627247 is for webperf [10:29:24] klausman: I am going afk in a few (lunch + gym), ok if we restart in the afternoon? [10:33:03] Aye [10:33:14] I need to shop some groceries for lunch anyway :) [10:38:52] all right :) [10:38:55] * elukey lunch! [13:06:16] hi teammm [13:06:43] yoojoo [13:08:47] o/ [13:14:28] Hullo [13:34:57] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/627282 when you have some time :) [13:37:09] nvm, Moritz was faster :) [13:37:12] ah! [13:37:16] 10Analytics-Radar, 10Event-Platform, 10Platform Engineering: Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10Ottomata) Is it possible that something is producing revision-create events other than EventBus? Not likely, right? I just looked through EventBus code and I don't s... [13:37:16] he is always faster [13:39:42] 10Analytics: install mwparserfromhell on spark for efficient usage of wikitext-dump in hive - https://phabricator.wikimedia.org/T262044 (10Ottomata) That's great stuff!!!!! [13:58:53] milimetric: o/ when you are around we can deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/626223 [14:00:10] ottomata: I have +2, do I basically just merge? [14:00:33] no ther'es more to it [14:00:42] gone for kids, back at standup [14:00:49] are there docs? [14:00:51] we merge, then apply the helm chart [14:00:53] i think i have to do that [14:00:58] 10Analytics-Radar, 10Event-Platform, 10Platform Engineering: Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10JAllemandou) There are events with different timetamps. In July 2020 (when the number of dups peaked) I found a revision with up to 16 events, and different timestamp... [14:01:19] k, merging [14:01:31] https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile [14:01:39] ok [14:02:09] actually [14:02:11] maybe you can do it... [14:02:11] hmmm [14:02:20] milimetric: log into deploy1001.eqiad.wmnet [14:02:27] cd /srv/deployment-charts/helmfile.d/services/staging/eventstreams [14:02:54] oh those docs are not yet for eventstreams...there isa reorg of the repo in progress [14:02:59] hasn't been applyed for es yet it looks like [14:03:16] they are getting rid of the datacenter specific service directories [14:03:21] but anway, for now we cd into them [14:03:26] milimetric: you want to try? [14:03:30] sure [14:03:36] I'm on deploy1001 [14:03:40] and reading the docs [14:03:40] ok ya, in that dir [14:03:49] source .hfenv; helmfile diff [14:04:10] ah we are using [14:04:11] https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_the_legacy_helmfile_organization [14:04:13] those docs ^ [14:08:59] ok, looks good, so gonna do -i apply now [14:09:10] ya [14:10:01] k sweet, I added a link to the deployment train, with a note to ask you about the rest of the process [14:10:08] ok that's just staging [14:10:10] did that work? [14:12:32] yeah, it worked, but it says version "0.2.3" [14:12:33] what's that? [14:13:22] so do the same in /srv/deployment-charts/helmfile.d/services/eqiad/eventstreams and \/srv/deployment-charts/helmfile.d/services/codfw/eventstreams? [14:13:35] yup [14:13:45] that's probably the chart version milimetric [14:13:50] k [14:13:53] not the software image version :) [14:18:08] I have applied ferm rules to kafka jumbo 1009, and I'd like to move on to 1006 (more trafficated node), ok to proceed? [14:18:23] I see that there is some ES deployment on going, I can wait in case [14:18:43] elukey: ES uses kafka-main [14:18:46] so proceed! [14:19:01] ahhhh right PEBCAK, then I can proceed :) [14:19:02] thanks [14:21:36] 1006 looks good [14:22:12] but I am going to tail syslog and roll out the rest of the node veeeery slowly [14:35:05] milimetric: how goes? [14:35:17] oh it's good, deployed [14:35:24] I thought it logged, but I guess that's in -ops [14:36:02] !log deployed eventstreams with new KafkaSSE version on staging, eqiad, codfw [14:36:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:39:21] oh hu i must not have eventstreams as an IRC keyword! [14:39:24] nice thanks milimetric ! [14:39:32] oh thank you! [14:39:48] it's nice knowing how this part works, I always wondered [14:54:56] added the ferm rules to 1008 too [14:56:11] I don't see any drop registered from ferm [14:56:30] * joal claps for elukey [15:00:55] ping elukey , ottomata [15:01:21] coming :) [15:09:25] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [15:13:54] 10Analytics: Purge raw webrequest_stats and webrequest_stats_hourly - https://phabricator.wikimedia.org/T262826 (10JAllemandou) [15:14:26] 10Analytics, 10Patch-For-Review: Use MaxMind DB in piwik geo-location - https://phabricator.wikimedia.org/T213741 (10Nuria) Let's make sure to test this is working once deployed, what does our data on piwik say? http://piwik.wikimedia.org [15:16:40] 10Analytics, 10Patch-For-Review: Use MaxMind DB in piwik geo-location - https://phabricator.wikimedia.org/T213741 (10Nuria) For reference: https://matomo.org/docs/geo-locate/ [15:18:08] 10Analytics-Clusters: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10klausman) a:03klausman [15:33:18] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10Nuria) [15:33:37] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 (10Nuria) [15:37:15] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Use MaxMind DB in piwik geo-location - https://phabricator.wikimedia.org/T213741 (10Nuria) a:03razzi [15:38:21] moved the tasks that klausman and razzi are working on to the kanban board [15:54:01] 10Analytics, 10Product-Analytics: Set up environment for Product Analytics system user - https://phabricator.wikimedia.org/T258970 (10mforns) [15:54:38] 10Analytics, 10Product-Analytics: Set up environment for Product Analytics system user - https://phabricator.wikimedia.org/T258970 (10mpopov) p:05High→03Medium [15:55:02] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Set up environment for Product Analytics system user - https://phabricator.wikimedia.org/T258970 (10Nuria) [15:55:21] 10Analytics, 10Analytics-SWAP, 10Product-Analytics, 10User-Elukey: pip not accessible in new SWAP virtual environments - https://phabricator.wikimedia.org/T247752 (10mforns) [16:03:14] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Refine drops $schema field values - https://phabricator.wikimedia.org/T255818 (10Nuria) We will tackle this problem once we move to spark 3 to make sure we can fix struts and map types [16:05:10] 10Analytics-Radar, 10Technical-blog-posts: Story idea for Blog: The Best Dataset on Wikimedia Content and Contributors - https://phabricator.wikimedia.org/T259559 (10srodlund) @Milimetric Just checking in to see how you are progressing on this? [16:10:41] 10Analytics-Clusters: Create a cookbook to automate the bootstrap of new Hadoop workers - https://phabricator.wikimedia.org/T262189 (10mforns) [16:13:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Review current usage of HDFS and establish what/if data can be dropped periodically - https://phabricator.wikimedia.org/T261283 (10mforns) a:03JAllemandou [16:15:41] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Stats for newer projects not available - https://phabricator.wikimedia.org/T258033 (10mforns) a:05mforns→03JAllemandou [16:19:34] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Epic: Add data quality alarm for mobile-app data - https://phabricator.wikimedia.org/T257692 (10mforns) [16:19:36] 10Analytics, 10Analytics-Kanban: Create data quality alarm on access-method - https://phabricator.wikimedia.org/T257276 (10mforns) [16:20:41] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10MW-1.35-notes (1.35.0-wmf.27; 2020-04-07), and 2 others: Set up an instance of EventStreams in beta that will allow for consuming any stream - https://phabricator.wikimedia.org/T253069 (10mforns) [16:24:58] 10Analytics, 10Analytics-Kanban: Hadoop Hardware Orders FY2019-2020 - https://phabricator.wikimedia.org/T243521 (10mforns) [16:36:02] 10Analytics-Kanban: Add Presto to Analytics' stack - https://phabricator.wikimedia.org/T243309 (10mforns) 05Open→03Resolved a:03mforns We already have Presto in our stack. In subsequent tasks we might add Alluxio to speed up Presto. [16:43:20] 10Analytics-Kanban, 10Analytics-Radar, 10Product-Analytics: Technical contributors metrics definition - https://phabricator.wikimedia.org/T247419 (10mforns) 05Open→03Resolved [16:43:24] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Epic: Tech Tunning Session metrics - https://phabricator.wikimedia.org/T247100 (10mforns) [16:43:27] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Epic: Tech Tunning Session metrics - https://phabricator.wikimedia.org/T247100 (10mforns) 05Open→03Resolved [16:44:12] 10Analytics-Kanban: Data quality Dashboards 2.0 - https://phabricator.wikimedia.org/T242995 (10mforns) 05Open→03Invalid We decided for Superset and Hue [16:46:00] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 6 others: Modern Event Platform: Schema Guidelines and Conventions - https://phabricator.wikimedia.org/T214093 (10mforns) 05Open→03Resolved [16:46:02] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10mforns) [16:48:13] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10mforns) [16:49:11] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0. - https://phabricator.wikimedia.org/T130256 (10mforns) [16:49:36] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Beta - https://phabricator.wikimedia.org/T186120 (10mforns) 05Open→03Resolved a:03mforns [16:55:12] should there be so many 5xx errors we're serving for POSTs to intake-analytics.wm.o? https://logstash.wikimedia.org/goto/eb14f96223481fd50d3409ce9164b575 [16:55:18] ottomata: ^ [17:11:16] * elukey afk! [17:11:23] Bye elukey [17:29:39] cdanis: that data looks strange as it is a json post to webrequest topic? (cc ottomata ) [17:29:59] cdanis: webrqeuest topic shoudl only receive data from varnishkafka directly not from eventgate [17:30:03] *should [17:30:18] nuria: oh, the logstash dashboard in question is one that does come from traffic, showing all varnishkafka-reported 5xx [17:30:36] but I still thought it was curious that we're seeing so many 5xx for intake-analytics [17:30:38] cdanis: i see, i missunderstood [17:36:26] looking at webrequest_sampled_128 it does still seem to be a very small fraction of total volume, so that's something [17:43:54] milimetric: the question about using x-analytics is for https://phabricator.wikimedia.org/T261952 [17:45:30] hey team, I'm not feeling well, thought I could work, but I'm starting to feel tired, will log off, see you tomorrow! [17:45:31] milimetric: thanks for the clarification in slack! looks like we'll be fine since this will not be a big volume of requests [17:47:00] (reading to catch up with that task) [17:47:19] Bue mforns - Take care! [17:48:34] bearloga: I get more of the context now. I think it's kind of a mistake to do this via webrequest, it's putting a needle in the haystack and then using up the cluster to find it again. Why not just emit events? [17:48:51] cdanis: no that is strange [17:49:03] i don't see 5xxes on the eg dashboard [17:49:13] oh is this kaios? [17:49:47] ottomata: the TTFB indicated on these looks to be ~60 seconds so they might be getting mis-routed or something at some layer before the k8s cluster and timing out [17:49:56] right, inuka, hm... we need to find a good eventing bridge for that [17:50:15] huh [17:50:16] milimetric: nope, web, but the inuka team is working on this. they originally wanted to send events but legal has no existing policy for sending events from external sites [17:51:07] so any kind of event-based solution is currently waiting on approval from legal when/if it ever comes [17:52:17] that's unfortunate, ottomata what do you think? Did you consider external events and/or talk to legal about it before? [17:52:41] I guess the problem is they're crossing over the internets [17:52:53] aye [18:00:08] ? [18:00:18] aren't external events always crossing over the internet? [18:04:31] 10Analytics-Radar, 10Event-Platform, 10Platform Engineering: Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10JAllemandou) Ping @Ottomata and @Pchelolo - I found something even more problematic: the duplication can happen with a change of performer! ` presto> SELECT ->... [18:05:54] like, external sources (e.g. wikipedia app, a new york times article, apple's siri service) can *technically* send events to us – the intake service allows for that – but we currently only have policy for ourselves (our apps). wikipedia preview, since it will run on partner websites, is not covered [18:07:50] hmm, is this also relevant to NEL cdanis ^? [18:08:04] ottomata: Legal said NEL was fine, FWIW, so I think not :) [18:08:21] cdanis: is working on getting chrome reportingAPI to send events to us. those are sent by the browser and browser code, not our JS code [18:08:38] bearloga: dunno about legality of anything, that's for legal to decide i guess. [18:09:09] i don;'t know what the difference between having them send an event; and us collecting the same data via webrequest would be though [18:09:15] seems legally the same thing tto me [18:09:56] ottomata: that's what I think too! but *snaps suspenders* I'm no big-city lawyer [18:10:53] I think it has to do with which origins rather than which code? but I really don't know :) [18:12:47] bearloga: is legal telling you the partners can't send events? [18:15:22] 10Analytics-Radar, 10Datasets-General-or-Unknown, 10Product-Analytics, 10Structured-Data-Backlog: Set up generation of JSON dumps for Wikimedia Commons - https://phabricator.wikimedia.org/T259067 (10Ramsey-WMF) $someone seems to still be null 😄 Do you have time for this @ArielGlenn ? [18:15:33] ottomata: correct, since in this case it’s not that they’re explicitly disallowed it’s that they’re not explicitly allowed and because of that events can’t be sent [18:17:53] but what is the difference? [18:17:56] explicitly allowed? [18:18:02] the data is being sent either way [18:18:19] via an http webrequest log, or via an explicit data send, (which is also an http webrequestt) [18:18:29] ¯\_(ツ)_/¯ [18:18:34] their not explicitly allowed by legal? [18:18:35] bearloga: (cc milimetric) [18:18:43] I fully support the event request [18:19:10] bearloga: are there any legal docs we can look at? [18:19:36] bearloga: cause we receive events from 3rd parties for now years ( cc ottomata ) like googletranslation [18:19:45] bearloga: so i do not see the difference here [18:21:02] bearloga: let's have meeting with legal to clarify that events are being sent /received and processed from 3rd parties for now some time [18:21:12] before taking any technical decision [18:21:41] nuria: sounds good! Since Neil is lead on this I’ll ask him to set it up [18:22:38] nuria: he and I are 100% with you on this fwiw and are similarly confused why one is fine but not the other [18:23:13] nuria: o/ [18:23:20] razzi: is going to sync up with luca about the matomo stuff tomorrow [18:23:25] we're looking for other things for him to work on [18:23:33] i could think of some newpyter stuff [18:23:40] related to cleaning up kernels maybe. [18:23:49] but, maybe there are other ideas? [18:24:01] OR actually hmmmm [18:24:03] razzi: [18:24:15] this one is tricky: https://phabricator.wikimedia.org/T255973 [18:24:21] but the prep work to actually do it is not [18:24:26] you'd learn a LOT about how Kafka works [18:24:31] Interesting, yeah [18:24:37] i can expand that task description to explain more [18:25:13] razzi: have you had much/any experience with jupyter notebooks in the past? [18:25:26] I've used them a bunch as a "consumer" [18:25:29] ok [18:25:33] thats good [18:25:37] I've only set up basic jupyter server stuff [18:25:48] ok let's see what nuria says but here are my two ideas for you, and you could choose which you'd prefer [18:26:15] we've got a new jupyterhub based notebook setup (we refer to it as newpyter) but it is still kidna experimental [18:26:27] one of the things I want to do is simplify the python spark kernels [18:26:38] How about newpyter, since I have some understanding of jupyter, and I'll take some time on my own to read up on Kafka [18:26:47] right now there are a bunch of custom pyspark kernels deployed as part of the old system (which is called SWAP) [18:27:19] gone for tonight team [18:27:34] i want to remove them, add docs to use pyspark directly from python without the custom kernels, and make a decomission plan for the old SWAP stuff [18:27:36] slowly [18:27:36] * razzi waves bye to joal [18:27:40] ok cool [18:27:51] ya the other would be mostly learning all about how kafka works [18:28:00] ok [18:28:15] nuria: does that sound ok? I'll try to make a task with more explicit instrucitno for newpyter help [18:30:19] ottomata: in teh onboardin doc [18:30:21] *the [18:30:28] ottomata: the next task we had was https://phabricator.wikimedia.org/T259307 [18:30:31] cc razzi [18:31:09] nuria: he can't do that one [18:31:16] he needs perms to remove an ldap user [18:31:32] ottomata: i see [18:31:51] ottomata: and this one is complete? [18:31:52] https://phabricator.wikimedia.org/T252617 [18:32:05] 'complete', no, it isn't relaly that well defined [18:32:14] although razzi ya a few more patches on taht would be good too [18:32:19] but you know what to do there I think ya? [18:32:43] maybe finishing upth the analytics profile classes would be good? [18:32:47] ottomata: well, maybe defining when that one can be called done is a good thing [18:33:16] ok, i think doing it for all the analytics profile classes would be good enough [18:33:19] ottomata: let's close that one before moving on to others [18:33:26] there's a billion classes we use in puppet that could be done [18:33:33] ok [18:33:37] nuria: and after that? [18:34:08] ottomata: matomo? [18:34:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Use types in Analytics Puppet classes/profiles/etc.. - https://phabricator.wikimedia.org/T252617 (10Ottomata) [18:34:21] he's working on that now, but needs to sync up with luca tomorrow [18:34:24] ottomata: and we should probably get one more for after matomo, ya? [18:34:26] about uit [18:34:28] yeah [18:34:53] was thinking newpyter migration? [18:39:05] ottomata: I think we can probably use a transitional task to that one, maybe this one? https://phabricator.wikimedia.org/T240439 [18:39:40] oh yaaa [18:39:41] hmmm [18:39:47] that is smaller for sure [18:39:49] mayb. [18:40:09] i have no idea how that works in puppet [18:40:10] but it ust [18:40:11] must [18:40:22] i think mutante (daniel zahn) would know? [18:40:25] hmm [18:41:58] ottomata: if that sounds good let's do that one before newpiter [18:42:13] 10Analytics: Prep for Newpyter promotion and SWAP decomission - https://phabricator.wikimedia.org/T262847 (10Ottomata) [18:42:17] 10Analytics, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10Nuria) a:03razzi [18:42:37] razzi: (cc ottomata ) background info for sites is here: https://phabricator.wikimedia.org/T227860 [18:42:44] razzi: linked from main ticket [18:43:29] 10Analytics-Radar, 10Performance-Team, 10Research: Citation Usage: Can instrumentation code be removed? - https://phabricator.wikimedia.org/T262349 (10Gilles) 05Open→03Resolved [18:44:58] ottomata: Re: types, I could keep going there, but I feel like it'd be more useful to add types as I make actual changes. So I'd recommend closing that or at least narrowing the scope [18:45:52] razzi: i narrowed the scope to just the analytics profile [18:45:54] classes [18:45:58] Alright cool [18:45:59] you could even just throw them all up uas one patch [18:47:25] oh look apparently i do know how to use envoy for tls [18:47:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/558660/ [19:06:49] 10Analytics, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10Ottomata) [19:09:23] 10Analytics, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10Ottomata) [19:11:11] 10Analytics, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10Ottomata) @Razzi, I only know how this works about 60%. You should probably thoroughly read through the puppet code starting in `profile::tlsproxy::envoy`. Trace it all the way... [19:26:06] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Som_Marcin) Hello Team, Can anyone please help me how you guys has passed the java agent for hive metastore via hive-env.sh f... [19:28:05] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Ottomata) @Som_Marcin https://gerrit.wikimedia.org/r/c/operations/puppet/+/469585/2/hieradata/role/common/analytics_cluster/coor... [19:44:14] ottomata: I'm not sure what the exact significance of a stream title is, but, I sent you https://gerrit.wikimedia.org/r/627353 [19:44:37] I'm guessing that we'd have to sync-file that and then rolling-restart the eventgate-logging-external pods? [19:45:05] I also have to confess that I'm not sure which part of this controls how this stuff is output to logstash :D [19:45:30] ah [19:45:34] cdanis: it needs to match the schema title [19:45:35] as in [19:45:35] (oh, I guess it just shows up as a `meta.stream` field in logstash?) [19:45:35] https://schema.wikimedia.org/repositories//primary/jsonschema/w3c/reportingapi/network_error/latest.yaml [19:45:38] title: w3c/reportingapi/network_error [19:45:45] oh right! [19:45:50] sorry, I was looking at the old version locally [19:45:56] the stream name can be anything [19:46:16] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Som_Marcin) @Ottomata Just wondering can we pass via hive-env.sh i am using apache-hive-2.3.7 tried below one but not working... [19:46:21] it will become the kafak topic(s) name [19:46:34] and possibly also the hive table name, if this ever goes to hive [19:46:41] yeah, I think we will probably want it to [19:46:44] eventually [20:04:07] (03PS1) 10Gerrit maintenance bot: Add ru to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/627357 (https://phabricator.wikimedia.org/T262812) [20:05:56] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Ottomata) Ah I get your question now. Just looked, it seems we export this to the metastore via the HIVE_METASTORE_HADOOP_OPTS... [20:08:34] (03CR) 10Ladsgroup: "Stupid bot. This should be abandoned :(" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/627357 (https://phabricator.wikimedia.org/T262812) (owner: 10Gerrit maintenance bot) [20:10:39] (03Abandoned) 10MarcoAurelio: Add ru to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/627357 (https://phabricator.wikimedia.org/T262812) (owner: 10Gerrit maintenance bot) [20:13:11] (03CR) 10Urbanecm: "thanks, but i finally figured out how to abandon it via ssh! 😄" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/627357 (https://phabricator.wikimedia.org/T262812) (owner: 10Gerrit maintenance bot) [20:22:10] cdanis: LGTM [20:22:16] you want to scap sync ? [20:22:25] ottomata: thanks! and yeah, I will sync-file [20:22:28] great [20:23:44] cdanis: the logstash part is manual and is defined in profile::logstash::collector7 [20:28:22] ottomata: okay I'm not sure what to edit there [20:28:50] see the two logstash::input::kafka { 'clienterror [20:29:00] you'll need the same for your topics [20:29:17] the topic will be the stream name you put in the stream config [20:29:23] ahhh [20:29:25] prefixed by either eqiad. or codfw. [20:29:28] yes, plus eqiad or -- right [20:29:50] also, you'll need to make sure your client POSTS so the URL with ?stream= [20:30:00] and the schema_uri too [20:30:11] which will be [20:30:24] /w3c/reportingapi/network_error/1.0.0 [20:31:40] yepyep :) [20:31:44] was doing that in my manual testing [20:31:56] ottomata: where does the `type` field come from in that puppet config? [20:32:12] cdanis: no idea [20:32:17] i guess its ffreeform? [20:32:29] # [*type*] [20:32:29] # Log type to be passed to Logstash. Default: none. [20:32:30] ? [20:33:00] maybe ask godog? [20:33:07] ahh I suspect it has to do with other logstash configurations [20:33:18] I'll ask shdubsh (and put him on reviewers as well) [20:33:40] aye [20:36:45] o/ [20:38:15] shdubsh: how can we help you? [20:38:44] I got a ping here [20:39:58] shdubsh: ottomata and I were wondering if https://gerrit.wikimedia.org/r/c/operations/puppet/+/627364 was even close to correct :) [20:43:17] cdanis: looks right to me [20:43:35] cool, thanks :) out of curiosity, what *is* the significance of the `type` field? [20:43:51] type is usually keyed off of when filters are applied [20:45:07] https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html#plugins-inputs-kafka-type [20:45:20] >The type is stored as part of the event itself, so you can also use the type to search for it in Kibana. [20:45:23] ah nice, that's useful too [20:45:28] thanks! [20:45:36] np :) [20:45:39] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Som_Marcin) @Ottomata have you just any other java option getting this error: ? ` Caused by: java.net.SocketException: Protocol... [20:46:55] one of you two want to give it a stamp? [20:50:05] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Ottomata) Can't say I know what that is about. Just in case, don't just copy paste what I got there, you probably only want the... [20:53:08] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Som_Marcin) yes i configured the right path seems to have some java issue. [21:00:55] also shdubsh another quick question -- should I disable puppet on all the logstash collectors, run it on one, make sure it doesn't explode, and then allow it to run on the others? or is the config reload relatively safe nowadays? [21:03:24] It's through a define so it should be pretty safe, but it's better to slow roll since logstash is touchy. [21:06:48] cdanis: i'm out for the day! excited to see this happening oh boy! [21:06:49] ttyt [21:08:40] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 (10Som_Marcin) @Ottomata Is it possible you can ping me the file path ? [22:57:19] 10Analytics, 10Operations, 10Patch-For-Review: Deploy an updated eventgate-logging-external with NEL patches - https://phabricator.wikimedia.org/T262087 (10CDanis) I believe the only thing left to do is to perform a rolling restart of the eventgate-logging-external pods (or the container within them). I'd l...