[00:12:55] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @Ottomata @elukey Job completed. Please let me know if you still observe the logging problem. [02:50:06] 10Analytics, 10Event-Platform, 10MassMessage, 10WMF-JobQueue: The mass-message queue reports 0 when there are still queued messages - https://phabricator.wikimedia.org/T209899 (10AntiCompositeNumber) Tools like MassMessage add jobs to the job queue for later processing outside of the main MediaWiki process... [03:07:56] 10Analytics, 10Analytics-EventLogging, 10Platform Team Initiatives (Abstract Schema): Convert EventLogging to AbstractSchema - https://phabricator.wikimedia.org/T268547 (10Reedy) 05Open→03Invalid Extension has no database tables [04:32:53] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:54:15] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:24:39] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10elukey) >>! In T268376#6642606, @Ottomata wrote: > Hm, @elukey may have given you the wrong file path for this test? `/etc/spark2/def... [07:35:55] Good morning [07:36:09] bonjour [07:52:40] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10JAllemandou) I confirm something has changed in the previous run: the logs are way less verbose (53k DEBUG lines and 1230k lines INFO)... [08:08:54] 10Analytics-Clusters, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) While we transition oozie jobs to analytics-hive.eqiad.wmnet, here's some info about other daemons: * hive metastore: https://docs.cloudera.com/documentation/enterpri... [08:29:08] 10Analytics: Fix purging pageview_actor data - https://phabricator.wikimedia.org/T268382 (10JAllemandou) [09:06:48] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Performance-Team, 10Patch-For-Review: Parse user agents in navtiming instead of relying on eventlogging to do it - https://phabricator.wikimedia.org/T260580 (10Gilles) 05Open→03Resolved [09:06:51] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 4 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Gilles) [09:16:03] !log drop principals and keytabs for analytics10[42-57] - T267932 [09:16:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:16:06] T267932: decommission analytics10[42-57] - https://phabricator.wikimedia.org/T267932 [09:16:08] ah no double log [09:16:21] anyway, old workers' creds cleaned up [09:34:17] 10Analytics: Fix purging pageview_actor data - https://phabricator.wikimedia.org/T268382 (10elukey) It seems that the timer runs correctly from the journal log, this is its config: ` ExecStart=/usr/local/bin/kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-drop-older-than --databas... [09:34:28] joal: --^ [09:59:55] 10Analytics: Fix purging pageview_actor data - https://phabricator.wikimedia.org/T268382 (10JAllemandou) > @JAllemandou the base path has an extra "hourly" that it is not mentioned in your hdfs ls, could it be the cause? I'm pretty sure it is! Thanks for catching! [10:05:22] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) I think we can remove: Everything that we remove from T231339#6612105 (which mean `net_cidr_src`, `net_cidr_dst` as... [10:09:04] so I found out that Luca from the past removed the hive token db setting because it was a problem on bigtop [10:09:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/576099/ [10:09:09] that is really weird [10:10:29] I just turned it on again manually on hadoop test, looks like it is working [10:23:45] 10Quarry, 10cloud-services-team (Kanban): Do some checks of how many Quarry queries will break in a multiinstance environment - https://phabricator.wikimedia.org/T267989 (10dcaro) I introduced the detection of implicit db usage (as in not explicitly stating the db name), and this is the current list of found c... [10:35:32] elukey: hey, for when you have time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/643104 [10:36:20] I'm going to bother you a lot this week and next, mwhahaaaaa [10:38:44] Amir1: hi! Yes I saw it, it is in my backlog, will do :) [10:39:09] Amir1: do you have some negative karma to balance that brings you to improve puppet coding? :D [10:39:33] Thanks! [10:39:48] elukey: Regarding negative karma, how do you feel about upgrading mailman? [10:39:52] :P [10:39:54] ahahahah [10:40:23] ok so you like the sre misery and pain, I get it now :) [10:40:40] jokes aside, really nice and helpful work, thanks! [10:40:47] the codebase is huge and the more help we get the better [10:42:13] I do :(((( [10:42:41] yeah, I'm quite happy to see it improve and get slightly more modern and robust [10:48:45] Amir1: left a comment but all looks nice! I am also wondering about deployment-prep, we should run puppet in there too as test to verify that nothing breaks [10:48:54] (I don't expect anything to break but..) [10:49:09] Amir1: I took the liberty of +2'ing change 643104 [10:49:25] I approve of its intent and am saddened by its necessity :) [10:49:48] Also, morning everyone :) [10:50:30] morning! [10:50:32] klausman: morning o/ [10:50:45] I will monitor deployment prep [10:51:51] elukey: I had a discussion with Andrew yesterday about my ATS/Varnish thing. We agreed that taking a sample of 1h or so of data and working with that to find the discrepancies is probably the better approach than always using live data. So I'll work on changing my code to facilitate that today. [10:52:09] Amir1: https://kafka.apache.org/documentation/#log.retention.bytes [10:52:15] Also, I may create a VarnishKafka topic that just has cp3050 data, to stem the tide a bit. [10:52:35] klausman: sure makes sense [10:54:12] Thanks. So both of them should be integer then [10:54:55] Let me do a PCC [10:55:45] Amir1: there is also another kafka cluster worth to check, the logging one [10:56:38] elukey: what is the host naming convention? [10:56:54] Amir1: I was trying to find an example, here it is: logstash1010.eqiad.wmnet [10:57:07] (very confusing I know, but kafka is colocated with logstash) [10:57:42] if you are curious, find all kafka occurrences in site.pp [10:58:11] after all I saw in mediawiki nothing surprises me anymore [10:59:01] exactly [10:59:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/643104 [11:00:02] PCC is happy in all cases [11:00:06] heya teamm :] [11:00:40] joal: have you seen T254332 last comments on Druid capacity? [11:00:40] T254332: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 [11:00:44] what do you think? [11:01:21] Hi mforns - I have seen the discussion - We need to talk with network people about their usages - For long term analysis they should use the cluster [11:02:01] mforns: Or at least it's my view - We won't be able to provide them with the amount of detail they want in Druid IFAICS [11:02:20] joal: I had a chat with a nice French guy this morning about it, I think that we'd need to set up a meeting in which we explain the pipeline [11:02:28] so it is clear for SRE folks interested [11:02:49] makes sense elukey - cc mforns [11:02:55] No problemo [11:03:11] elukey: looks like the technical-prez of analytics will be used again! [11:03:17] mforns, joal - if any of you is interested we can set up something! [11:03:34] joal: but I'm worried about the 90 days worth of data in druid, they are still big by themselves, 1.8TB so far, and if we add the new fields... that could be maybe 3TB+? [11:04:09] right mforns - It'll depend on the aggregation level etc [11:04:19] elukey: yes, cool [11:04:23] we should definitely talk before acting [11:04:54] you mean in a meeting? [11:05:44] they are open to reducing the Druid raw retention from 90 to 60 days if necessary [11:06:39] mforns, elukey: Shall we try to explain the pipeline first, then decide - or make decision first and then explain\? [11:07:35] dunno [11:08:10] I think that we can proceed and then explain, they will eventually follow our advice and restrictions etc.. [11:08:29] netflow in druid is needed for ddos/realtime/etc.. [11:08:39] so it is fine if we don't have a big history [11:09:01] the interesting part is to explain, in my opinion, how to do the traffic analysis part on hive/spark/presto [11:09:15] aha [11:11:04] Faidon also has, IIRC, a UI tool that fetches from druid and visualize BGP AS that we exchange traffic with the most [11:11:10] to suggest peering option etc... [11:16:32] elukey: given the data-size presto should be a no-go [11:16:43] spark in that case [11:18:33] joal: because it is small? [11:18:44] because it's HUGE :) [11:18:57] more than webrequest? [11:19:41] I mean if they are gentle in making queries it should be fine, also hopefully we'll have alluxio soon-ish :) [11:20:37] Amir1: all deployed, no-op as expected [11:22:15] elukey: it depends a lot on time-span [11:22:35] elukey: recent data is almost 400Gb per month [11:22:50] elukey: If analysis is needed for a few days, presto should do (as druid) [11:23:07] joal: sure we can of course make tests with them etc.., if unfeasible we switch to spark [11:23:16] if analysis is needed over a few months, presto is not [11:23:34] not even with alluxio? [11:24:13] anyway, whatever tool is best, spark sql is very nice as well [11:24:21] also I think that people will like pyspark [11:24:33] elukey: spark-sql with correct logging level set should not be too complicated :) [11:24:44] elukey: And notebooks with charts [11:25:02] elukey: We can help them setup examples from which they can reuse [11:25:31] yes sure [11:25:54] elukey: Thanks <3 [11:26:14] joal, elukey, :] leaving now, will be back later! [11:26:56] ack [11:34:35] * elukey lunch! [12:42:10] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @JAllemandou @elukey Testing now (driver is stat1005) with @Ottomata's sugesstion to use the `/etc/spark2/conf/lo... [12:59:44] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10JAllemandou) The above job is still providing DEBUG logs in driver - I have checked while the job was running. This is really weird :... [13:14:07] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @JAllemandou I do not know if it helps, but remembering what I've learned from the Pyspark docs - in my first attem... [13:14:21] 10Analytics, 10observability, 10Graphite: statsv seems to be down or broken - https://phabricator.wikimedia.org/T268624 (10Gilles) a:03Ottomata [13:15:43] 10Analytics, 10observability, 10Graphite: statsv seems to be down or broken - https://phabricator.wikimedia.org/T268624 (10Gilles) Seemingly broken by https://gerrit.wikimedia.org/r/c/analytics/statsv/+/639223 and/or https://gerrit.wikimedia.org/r/c/operations/puppet/+/639216 [13:16:43] 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down or broken - https://phabricator.wikimedia.org/T268624 (10Gilles) a:05Ottomata→03Gilles [13:22:00] (03PS1) 10Gilles: Convert message from bytes to str [analytics/statsv] - 10https://gerrit.wikimedia.org/r/643252 (https://phabricator.wikimedia.org/T268624) [13:23:26] (03PS1) 10Gilles: Review access change [analytics/statsv] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/643153 [13:24:06] (03CR) 10Gilles: "So we don't depends on the Analytics anymore for emergency fixes to statsv" [analytics/statsv] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/643153 (owner: 10Gilles) [13:43:25] (03PS1) 10Joal: Update pageview title extraction for trailing EOL [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/643255 [13:46:32] 10Analytics, 10Analytics-Kanban: Fix pageview title accepted values (trailing EOL) - https://phabricator.wikimedia.org/T268630 (10JAllemandou) [13:46:50] 10Analytics, 10Analytics-Kanban: Fix pageview title accepted values (trailing EOL) - https://phabricator.wikimedia.org/T268630 (10JAllemandou) a:03JAllemandou [13:47:15] (03PS2) 10Joal: Update pageview title extraction for trailing EOL [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/643255 [13:48:34] (03CR) 10Elukey: [V: 03+2 C: 03+2] Convert message from bytes to str [analytics/statsv] - 10https://gerrit.wikimedia.org/r/643252 (https://phabricator.wikimedia.org/T268624) (owner: 10Gilles) [14:00:08] 10Analytics, 10Product-Analytics, 10Inuka-Team (Kanban): Set up preview counting for KaiOS app - https://phabricator.wikimedia.org/T244548 (10SBisson) p:05Triage→03Medium a:03SBisson [14:01:03] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10Ottomata) Hm, it sounds like the one time the logs were silenced was when you provided a log4j file path that did not exist, possibly... [14:03:20] 10Analytics, 10Performance-Team, 10observability, 10Graphite, 10Patch-For-Review: statsv seems to be down or broken - https://phabricator.wikimedia.org/T268624 (10Gilles) 05Open→03Resolved The fix appears to work, I see new data coming into Grafana. [14:08:35] (03CR) 10Milimetric: [C: 03+2] Review access change [analytics/statsv] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/643153 (owner: 10Gilles) [14:09:29] elukey: when you have time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/641451 [14:10:38] thanks! [14:10:58] also, elukey, I think we need to turn on Erik's cron job again [14:11:04] as painful as it might be [14:11:32] I'll poke around a bit from the backup you took [14:11:50] (this is for pagecounts-ez, due to more problems with pageview complete) [14:13:20] milimetric: ran puppet on labstores, can you check the page? [14:14:07] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: pageviews complete have irregular lines - https://phabricator.wikimedia.org/T267575 (10Milimetric) Quick update: we found more problems that we inherited from the other data sets, namely page titles with \n or \r in them, messing up the lines. We will try... [14:14:36] yep, looks good elukey, thx [14:14:38] (https://dumps.wikimedia.org/other/pagecounts-ez/) [14:22:27] 10Analytics, 10Analytics-Kanban: Fix pageview title accepted values (trailing EOL) - https://phabricator.wikimedia.org/T268630 (10JAllemandou) [14:22:30] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: pageviews complete have irregular lines - https://phabricator.wikimedia.org/T267575 (10JAllemandou) [14:24:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: pageviews complete have irregular lines - https://phabricator.wikimedia.org/T267575 (10JAllemandou) I added a child task about fixing the trailing EOLs. I'll also create a UDF about validating an existing page-title so that we can filter wrong titles from... [14:26:32] (03CR) 10Ottomata: [V: 03+2] Review access change [analytics/statsv] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/643153 (owner: 10Gilles) [14:36:23] 10Analytics-Clusters, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) It sounds like https://issues.apache.org/jira/browse/HIVE-17368 is the issue that we are facing, and the fix is: https://github.com/apache/hive/commit/623ecaa53bc7d55... [14:40:25] ottomata: o/ when you have a moment can you tell me if something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/641958 makes sense or not? I am trying to clean up a little our puppet codebase, and I am wondering if it is worth or not [14:42:39] ah sorry elukey thougth that was one i already +1ed [14:42:39] hm [14:43:12] it is an idea, I am using this week to try multiple things :) [14:43:51] the main driver is that if we use kerberos::exec we assume it is for kerberos, and adding "enable => true" is a little redundant for me [14:44:02] can you just deafult $use_kerberos to true in kerberos::exec? [14:44:02] it was useful while transitioning but now not sure [14:44:11] then could we control it via a global hiera somehow? [14:44:40] i'm for getting rid of all the parameters in the classes we pass down; it really only needs to be a global switch someow [14:45:25] but then we'd need to have a lookup somewhere that reads the global param [14:47:59] ottomata: the alternative could be that kerberos-run-command in labs doesn't really kinit [14:48:04] but just execute the command [14:48:29] it would work for both timers and execs [14:52:17] elukey: is that bad? why not in kerveros::exec? [14:52:44] $kerberos_enabled => lookup('kerberos_enabled', 'default' => true) [14:52:51] param in kerberos::exec ? [14:53:04] or [14:53:07] ottomata: not sure, it is a define that we use everywhere, lookup should be used only in profiles [14:53:14] oh right hm [14:54:06] i guess doing it in a place for both execs and timers is good [14:55:58] hmm elukey i like what you are trying to do and am all for it if you have strong opinions! you are the main operate so I see the convenience. it does kinda feel wrong somehow though [14:56:16] i guess; you are just making kerberos required for these puppet classes though [14:56:43] elukey: maybe if you added some notes or comments about that somewhere? that the cdh module requires that kerberos is enabled? [14:56:53] and configured? [14:58:06] ottomata: yep I can do it, I'll also see with Moritz if the kerberos-run-command thing could work, it might be easy to test stuff if needed in cloud/labs [14:58:23] thanks for the brainbounce, will let you know before pulling the trigger ok? [14:59:06] ok! or pull it without me that is fine too :) [14:59:14] ack :) [14:59:20] !log move analytics1072 from rack B2 to B3 [14:59:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:29:00] mforns: how do I test the streams UI with some real data, it won't connect to beta from my local, do I have to check it out somewhere on that network? [15:33:04] mforns: ok! eventstreams deployed in prod with openapi updates [15:33:16] https://stream.wikimedia.org/?doc#/streams [15:34:11] ottomata: you know how to test the streams ui? [15:34:21] (trying to code review) [15:34:51] milimetric: if you change eventStreamUri in src/config/index.js [15:34:59] you can run it against e.g. the prod instance [15:35:27] using npm run serve in the ui directory [15:35:51] the default is to run against the local repo using dist/ [15:36:04] you'd have to run eventstreams for that, which isn't hard, but then to test the actual streaming you'd have to have kafka et.c [15:36:14] so for testing you can just have it hit the prod instance [15:38:34] !log move druid1005 from rack B7 to B6 [15:38:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:40:06] ottomata: got it, I had set it to the beta eventStreamUri and that wasn't working from local (I'm not sure why), but prod worked [15:40:25] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:40:37] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:41:03] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:41:14] uff [15:41:43] this is not great, a host down shouldn't cause this mess [15:41:51] ouch [15:42:20] it is only on part of the aqs cluster, I think it will recover soon [15:42:55] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:43:01] I am wondering if it is the check or the API [15:43:24] namely, if the check needs a shorter timeout [15:43:31] or if it is AQS hanging on druid's conns too much [15:44:57] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:45:03] ah interesting wikistats doesn't work [15:46:01] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:46:17] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:46:21] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:46:24] 10Analytics, 10DBA, 10WMF-NDA: Remove 'kraken' user - https://phabricator.wikimedia.org/T268636 (10Marostegui) 05Open→03Resolved a:03Marostegui Thanks, I have dropped it: ` # ./section s4 | while read host port; do echo "$host:$port"; mysql.py -h$host:$port -e "show grants for 'kraken'@'10.64.%';";done... [15:46:27] 10Analytics, 10DBA, 10WMF-NDA: Remove 'kraken' user - https://phabricator.wikimedia.org/T268636 (10Marostegui) [15:46:35] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:47:09] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:47:25] milimetric: do you know how much time aqs waits for druid's response? [15:47:57] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:49:05] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:50:05] (looking) [15:53:48] elukey: should be druid.server.http.defaultQueryTimeout which defaults to 5 minutes [15:54:42] milimetric: nono we have set 5 seconds IIRC for druid to timeout, but I am wondering if aqs somehow hangs, or if it is the problem with brokers hanging on historicals [15:54:47] like when we drop datasources [15:55:18] 5 seconds seems way too fast... I feel like a lot of queries would time out in normal operation. The reason they look fast is 'cause they're cached [15:55:28] (03CR) 10Fdans: [C: 03+1] "thank you for doing this joal, looks good to me" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/643255 (owner: 10Joal) [15:55:43] 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down or broken - https://phabricator.wikimedia.org/T268624 (10Lucas_Werkmeister_WMDE) Great, thank you! [15:57:33] milimetric: we have set it to 5s since the problem is that brokers stop working waiting for historicals, and longer timeouts means a quicker queue that gets built on top of historicals [16:05:03] milimetric: also druid1005 is booting now, so it auto-recovers, I think it is the same issue of the brokers -> historicals queue [16:05:17] heya ottomata, looking at the eventstreams ui... does the node-rdkafka lib version need to be version 2.3.4? I have a weird error with that version and my node version. It disappears when I replace 2.3.4 with ^2.3.4 in the package.json [16:05:33] is it ok to change? [16:05:35] nope that sounds fine [16:05:39] ok [16:30:28] mforns: o/ - is https://phabricator.wikimedia.org/T257692 for this week's train? [16:30:56] or already done etc..? (I see it in ready to deploy) [16:33:48] 10Analytics-Radar, 10Analytics-Wikimetrics, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Archive analytics-wikimetrics (deprecated by Event Metrics) - https://phabricator.wikimedia.org/T219334 (10Milimetric) oops, this slipped by our radar. Any repositories here can be... [16:52:19] 10Analytics, 10Analytics-Wikistats, 10Product-Analytics: Contribution inequality graphs for Wikistats - https://phabricator.wikimedia.org/T195033 (10Jan_Dittrich) I calculated the [[ https://en.wikipedia.org/wiki/Hoover_index | hoover index ]] for some contribution metrics recently and it was rather easy to... [17:38:30] Hi all, is it a known problem that selecting page_id from wmf.webrequest with Spark fails? "select page_id from wmf.webrequest ..." --> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted [17:49:55] Hi tizianop - It's a known problem I can explain [17:49:56] hm, no that sounds very strange tizianop ! [17:49:57] oh [17:50:00] never mind listen to joal :) [17:50:11] tizianop: in meeting now, will talk after please :) [17:50:31] ok, thanks! [18:06:02] I am reading https://wikitech.wikimedia.org/wiki/Analytics/Systems/Wikistats_2#Deployment [18:06:11] do you people use docker to run npm? [18:06:38] in theory using the same base image as thorium, namely debian stretch [18:06:48] but I don't see any reference for it in the docs [18:07:34] (I do it for superset and turnilo for example) [18:07:46] fdans, milimetric --^ [18:08:26] the other thing is if there is a version of npm that we all use [18:08:32] elukey: hmmm we don't, we just use our dev envirnoments [18:09:06] elukey https://www.irccloud.com/pastebin/zDxH0eiE/ [18:12:16] fdans: ok so we target a specific combination of node/npm, should we state it explicitly in the docs? [18:14:06] ah more an npm version, we don't really care about nodejs on thorium right? [18:14:45] for example, npm on buster is 5.8 https://packages.debian.org/buster/npm [18:15:01] on stretch it is not there, it needs a thirdparty [18:15:30] ottomata: Would you have time for Gobblin talk now? [18:17:53] (03CR) 10Joal: [C: 04-1] Add dimensions to editors_daily dataset (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607361 (https://phabricator.wikimedia.org/T256050) (owner: 10Conniecc1) [18:20:57] joal: just started making some lunch, can we do in 20 mins? if not now is fine! [18:21:20] ok for me ottomata :) [18:21:24] ok! [18:21:44] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10JAllemandou) Ok I think we nailed it @cchen :) I'm commenting on the CR and adding the result of our test below. ` // Query with AND clause filte... [18:21:53] tizianop: Here I am [18:23:07] tizianop: earlier this month we deployed a change to the webrequest table that should have impacted only the bucketization of the table (useful when sampling) [18:25:29] tizianop: Unfortunately this deploy also made happen a change supposedly long done (change of type for page_id, from int to long) that had actually never been deployed [18:26:07] tizianop: this leads to errors when reading the page_id fields for data older than the newly computed data after the deploy [18:26:49] tizianop: There are ways to trick the system and access the data - let me know if you need them [18:28:05] 10Analytics-Clusters, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) The correct patch for hive 2.x should be https://github.com/apache/hive/commit/b3a6e524a6cb17893cbbeb26a877b0196ba16e21.diff [18:28:23] joal: so the patch for hive (the 2.x branch) should be https://github.com/apache/hive/commit/b3a6e524a6cb17893cbbeb26a877b0196ba16e21.diff [18:28:28] I am rebuilding the packages now [18:28:36] BIGTOP IS SO AWESOME [18:28:53] * joal is so happy to feel elukey happy :D [18:29:20] if it works and bigtop 1.5 doesn't have it I'll propose a patch [18:32:44] joal: Yup, that's the problem. I tried to query more recent data and it works (before I was using Nov 1st). I think for now I can simply work with the last week! Thank you very much :) [18:33:02] Thank you for your understanding tizianop :) [18:33:49] (03PS1) 10Elukey: Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643329 [18:41:18] a-team: I am going to deploy refinery, afaics there is only a change from yours truly, so no need for source [18:41:27] k! [18:41:27] please tell me otherwise, I'll wait some minutes [18:47:07] !log deploy analytics refinery as part of the regular weekly train [18:47:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:49:51] elukey: do you know sth about changes in navtiming? [18:50:44] mforns: mmm do you mean if anything changed recently ? [18:50:46] it seems volume of events for that data set dropped in the last hours to less than half [18:51:01] ah lovely [18:51:24] mforns: that it an eventlogging stream right? [18:51:51] elukey: yes [18:51:59] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-24h&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eventlogging_NavigationTiming [18:52:04] I'd say that the drop corresponds to desktop traffic [18:53:14] there is also https://grafana.wikimedia.org/d/000000505/eventlogging?orgId=1&viewPanel=13&from=now-24h&to=now-5m [18:54:00] mmmm the last jump is around 13:00, that corresponds more or less with the drop in navtiming.. is it possible mforns ? [18:54:43] (03CR) 10Fdans: "elukey: this is missing the built dist directory right?" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643329 (owner: 10Elukey) [18:54:52] yes [18:56:17] (03CR) 10Fdans: [V: 03+2 C: 03+2] Pageview complete - Print explicit null values when there's no page id [analytics/refinery] - 10https://gerrit.wikimedia.org/r/642079 (https://phabricator.wikimedia.org/T267575) (owner: 10Fdans) [18:57:14] (03CR) 10Elukey: "> Patch Set 1:" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643329 (owner: 10Elukey) [18:57:55] mforns: I am tailing the error topic and I see a lot of navtiming events [18:58:09] "message": "Additional properties are not allowed ('cacheResponseType' was unexpected)" [18:58:19] aha [18:58:58] https://meta.wikimedia.org/w/index.php?title=Schema:NavigationTiming&action=history [18:59:01] mmmm [18:59:14] seems the events for mobile devices are ok, but lots of desktop events are missing [18:59:15] the last change was in october [18:59:21] aha [19:00:08] elukey: does the schema version in the error events correspond with the newest schema?> [19:00:35] I was about to check [19:00:53] are you kafkacatting them? [19:01:21] yep! [19:01:28] will try too [19:01:29] kafkacat -t eventlogging_EventError -C -b kafka-jumbo1001.eqiad.wmnet [19:01:53] "revision": 20373802 [19:02:08] that seems the last one [19:02:24] ah no wait, 20373802 is not the last [19:02:49] gilles: around? :) [19:05:03] ottomata: any chance soon? [19:05:03] !log deploy refinery to hdfs (even if not really needed) [19:05:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:06:05] joal: yes now perfect [19:06:06] sorry [19:06:12] To the cave? [19:06:14] !n [19:06:16] yes! [19:08:10] mforns: do you want to open a task? [19:16:19] elukey: schema version issue on navigationtiming? [19:17:40] gilles: yeah it seems so, mforns was wondering why navtiming's traffic was dropping [19:17:50] and it seems lining up with eventlogging's error topic [19:18:00] the patch that introduced this changed bumped the version in extension.json to 20521683 [19:18:11] which does have the cacheResponseType field [19:18:32] and I see events with that field but 20373802 as revision [19:18:35] https://gerrit.wikimedia.org/r/q/I1645631724f6f803a0bf145f2b52d2196cae71c3 [19:18:49] you can tail from any stats [19:18:50] kafkacat -t eventlogging_EventError -C -b kafka-jumbo1001.eqiad.wmnet [19:19:43] I understand, what I mean is that the patch was consistent in that respect [19:19:59] yep yep [19:21:01] which mediawiki version is affected? [19:21:09] gilles: so traffic volume for nav timing is really weird [19:21:11] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-2d&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eventlogging_NavigationTiming [19:21:29] elukey: other variations are possibly affected by oversampling [19:21:45] gilles: and see https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-2d&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eventlogging_EventError [19:21:47] but if you check in Hive with event.isoversampling=false, you can see the differences [19:21:57] for today's drop [19:22:55] the contents of /srv/mediawiki/php-1.36.0-wmf.18/extensions/NavigationTiming/extension.json are correct [19:23:02] it's pointing to 20521683 [19:23:19] maybe it's worth re-deploying that file to mediawiki hosts? [19:25:17] gilles: so the drop in the graph matches with today at 13:00 UTC more or less [19:25:21] and there was a deploy https://sal.toolforge.org/log/ZLpY-nUBgTbpqNOmb0r4 [19:26:08] (the drop in navtimings events I mean) [19:28:17] yeah, I suspect that extension.json didn't sync to all the mediawiki hosts or something like that [19:28:55] (03PS2) 10Elukey: Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643329 [19:29:13] gilles: I can grep with cumin and see [19:29:14] I'll try syncing that file right now, can't hurt [19:29:17] ah okok [19:29:26] (03CR) 10jerkins-bot: [V: 04-1] Release 2.8.3 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643329 (owner: 10Elukey) [19:30:37] done... [19:33:50] !log kill and restart webrequest_load bundle to pick up analytics-hive.eqiad.wmnet settings [19:33:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:35:03] (03PS3) 10Joal: Update pageview title extraction for trailing EOL [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/643255 [19:42:15] (03CR) 10Elukey: "What I did was:" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/643329 (owner: 10Elukey) [19:42:38] joal: ok webrequest_load is now set for analytics-hive, let's see :) [19:42:54] ok elukey :) [19:43:24] I am going to eat something and then I'll check again :) [19:43:28] * elukey afk for dinner! [19:44:42] Gone for now - will come back later as well [19:47:52] gilles: elukey mforns [19:48:01] the schema version has an override in mediawiki config [19:48:07] ah [19:48:25] oi thinkhttps://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/639600/1/wmf-config/InitialiseSettings.php [19:48:25] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/639600/1/wmf-config/InitialiseSettings.php [19:48:29] hasn't been merged [19:48:40] was waiting to do a few together because i didn't want to break thinggs, but it looks like I did! [19:48:43] will merge that now [19:51:49] (03PS2) 10Joal: [ONE-OFF] Add job to fix pageview-complete [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/642109 [19:52:08] hah, I totally forgot about that... :-| [19:54:30] elukey: synced change, i'd expect errors to go away eventually, after clients expire their JS caches...however long that takes [20:01:36] ottomata: ok thanks! [20:02:23] webrequest_load didn't start yet, will check a little later! [20:07:24] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) OK, cool. Knowing that you'd be open to reducing the retention period for Druid storage if necessary, what I'll do i... [21:19:00] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) [21:19:43] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) p:05Triage→03High [21:35:20] milimetric: I think the selected-streams-in-url is not possible with the current controls [21:35:58] the select doesn't have a way to preload selected streams (extracted from url) [21:36:43] so if someone uses a link with say #?streams=wikimedia.page-create there's no way to initialize the select with that information [21:40:48] I'd have to split the select into 2 components: a simple select plus a selected streams list [21:41:22] actually I started with that, but switched to el-select, given that it implemented all I wanted [21:41:49] what do you think, worth the change? [21:47:22] 10Analytics: Avro Deserializer logging set to DEBUG in pyspark lead to huge yarn stderr container files (causing disk usage alerts) - https://phabricator.wikimedia.org/T268376 (10GoranSMilovanovic) @Ottomata If you are asking me to `... make a custom log4j.properties file as described in` in T268376#6645009 - I... [21:58:43] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) - Implemented in: WDCM main modules, driver: stat1005. [22:00:08] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) - Implemented in: WDCM Biases Pyspark module, driver: stat1007. [22:00:38] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) - Implemented in: Wikidata Human vs Bot edits, driver: stat1005. [22:02:12] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) - Fixed WDCM Pyspark deployment <= 2Tb of cluster memory usage (max. dynamic allocation = 1... [22:04:40] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) - Changes implemented in Wikidata External Identifiers Landscape system, driver: stat1005. [22:07:50] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) - Changes implemented in the Wikidata Languages Landscape system, driver: stat1008 (note: n... [22:11:54] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) - Implemented: Wikidata Pageviews per Type system, driver: stat1007. [22:16:38] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) - Implemented: Wikidata Usage and Coverage system, driver: stat1004. @JAllemandou I would... [22:16:47] 10Analytics-Clusters, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Downscale Pyspark scripts spark2-submit configs to manageable proportions - https://phabricator.wikimedia.org/T268684 (10GoranSMilovanovic) p:05High→03Low [22:17:34] joal: https://phabricator.wikimedia.org/T268684 implements all constraints on my Pyspark scripts (as described in your recent e-mail). Please review the ticket when you find some time and let me know if I can resolve it. Thanks.