[00:01:59] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Remove the HelpPanel schema from the EventLogging whitelist - https://phabricator.wikimedia.org/T234855 (10nettrom_WMF) 05Open→03Declined This makes no sense as we'd probably just ask to whitelist it again in a month's time. Declining. [00:11:08] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Help panel: delete sanitized data from before Oct 1 - https://phabricator.wikimedia.org/T234870 (10nettrom_WMF) [00:12:31] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Help panel: delete sanitized data from before Oct 1 - https://phabricator.wikimedia.org/T234870 (10nettrom_WMF) [00:16:02] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Help panel: delete sanitized data from before Oct 1 - https://phabricator.wikimedia.org/T234870 (10nettrom_WMF) The second part of this, me deleting the initial set of data has been done: ` hive (nettrom_growth)> DROP TABLE helppanel_0410; OK... [03:31:12] 10Analytics, 10Analytics-EventLogging, 10Operations, 10decommission, 10ops-eqiad: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) ` papaul@asw2-c-eqiad# show | compare [edit interfaces] - ge-7/0/4 { - description dbproxy1009; - } [03:31:50] 10Analytics, 10Analytics-EventLogging, 10Operations, 10decommission, 10ops-eqiad: Decommission dbproxy1004 and dbproxy1009 - https://phabricator.wikimedia.org/T228768 (10Papaul) [03:40:05] 10Analytics, 10DC-Ops, 10Operations, 10decommission, 10ops-eqiad: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Papaul) No switch port reference for kafka1014 and kafka1022 on asw2-c-eqiad or asw-c-eqaid [03:41:24] 10Analytics, 10DC-Ops, 10Operations, 10decommission, 10ops-eqiad: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Papaul) [05:00:33] 10Analytics, 10Analytics-Cluster, 10DC-Ops, 10Operations, 10ops-eqiad: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10elukey) 05Open→03Resolved >>! In T232069#5553714, @wiki_willy wrote: > Thanks @elukey . Should we ignore/reso... [05:02:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) All good after 24 hours, it seems that 1.4.7 is the good one! [05:02:47] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) [05:39:19] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: drop CitatitionUsage data on mysql - https://phabricator.wikimedia.org/T233893 (10elukey) Done! Checked with `du -hsc /srv/sqldata/_log_CitationUsage_*` on both hosts, no data anymore stored. [05:39:35] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: drop CitatitionUsage data on mysql - https://phabricator.wikimedia.org/T233893 (10elukey) p:05High→03Normal a:05Ottomata→03elukey [05:44:23] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Drop page create event data on mysql - https://phabricator.wikimedia.org/T233892 (10elukey) p:05High→03Normal a:05Ottomata→03elukey [05:54:49] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Drop Navigationtiming data entirely from mysql storage? - https://phabricator.wikimedia.org/T233891 (10elukey) Data from 2018 in various tables: ` elukey@db1108:~$ for table in `cat el_nav_tables`; do echo $table; sudo... [06:07:05] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Drop Navigationtiming data entirely from mysql storage? - https://phabricator.wikimedia.org/T233891 (10elukey) To recap: * Drop table ` NavigationTiming_17703215 NavigationTiming_17729222 NavigationTiming_17740037 Navi... [06:09:35] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Drop Navigationtiming data entirely from mysql storage? - https://phabricator.wikimedia.org/T233891 (10elukey) Correction: among the tables to drop I can see some of them containing data from 2018: ` NavigationTiming_17... [06:23:32] RECOVERY - Check the last execution of reportupdater-published_cx2_translations on stat1006 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:25:56] this is not true --^ [06:25:58] will alarm again [07:07:10] PROBLEM - Check the last execution of reportupdater-published_cx2_translations on stat1006 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-published_cx2_translations https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:13:48] yep... [08:18:30] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Drop Navigationtiming data entirely from mysql storage? - https://phabricator.wikimedia.org/T233891 (10Gilles) Yes, data from 2018 is safe to drop since we have it in Hadoop already. This all seems correct to me. [08:19:24] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Drop Navigationtiming data entirely from mysql storage? - https://phabricator.wikimedia.org/T233891 (10elukey) >>! In T233891#5554608, @Gilles wrote: > Yes, data from 2018 is safe to drop since we have it in Hadoop alrea... [08:21:50] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Drop Navigationtiming data entirely from mysql storage? - https://phabricator.wikimedia.org/T233891 (10elukey) Triple checked 2017 data on the tables listed by Gilles: ` elukey@db1108:~$ for table in `cat el_nav_tables_... [08:22:52] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Drop Navigationtiming data entirely from mysql storage? - https://phabricator.wikimedia.org/T233891 (10elukey) >>! In T233891#5554355, @elukey wrote: > To recap: > > * Drop table > > ` > NavigationTiming_17703215 > Nav... [08:25:27] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10elukey) Given T234826, the next steps for this task in my opinion are: 1) stop producing events from eventlogging to db1107 2... [08:35:53] I am roll restarting historicals and brokers on the druid analytics cluster [08:36:00] with a relaxed timeout setting [08:36:30] Loading segment[6417/10492] [08:36:35] 10k segments! [09:22:08] !log delete druid old test datasource from the analytics cluster - test_kafka_event_centralnoticeimpression [09:22:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:26:38] ok 6k segments were related to --^ [09:26:50] for each druid host :P [09:29:55] 10Analytics, 10Analytics-Kanban: Superset not able to load a reading dashboard - https://phabricator.wikimedia.org/T234684 (10elukey) ok so now we have a more reasonable timeout for the druid analytics cluster (10s) but the issue persists, so we can rule out timeouts. [09:46:43] errand and early lunch! [12:25:03] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) I have stopped the kdc on krb1001 to simulate a host down scenario. I am able to renew my krb ticket but I want to leave it down for hours to see what happe... [12:55:00] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-consumer@mysql-m4-master-00 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [12:57:51] downtime expired! [12:58:05] they are working on db1107's rack [12:58:13] so I stopped the mysql consumer [12:59:50] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [13:06:15] 10Analytics, 10Research: Taxonomy of new user reading patterns - https://phabricator.wikimedia.org/T234188 (10Ottomata) cc @Milimetric and @mforns too in case they have query optimization tips for Martin :) [13:07:23] o/ [13:10:08] o/ [13:12:03] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Ottomata) > bacula backups for the analytics databases and the snapshot for the log database should be enough for this use case Q, will the bacula backups also include the `log` database? Might... [13:13:21] ottomata: is it ok to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/541324/ ? [13:13:40] didn't know the status of the migration [13:13:51] but the ru jobs are not working [13:14:04] (the ones migrated to hive) [13:14:41] ah for sure! [13:14:49] still checking email hadn't gotten there yet :) [13:14:57] elukey: i can merge [13:15:27] ... why didn't i see this emial yesterday...? [13:15:53] I wondered as well, didn't know if it was on purpose so I waited [13:16:03] I'll merge, continue reading :) [13:16:14] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Marostegui) >>! In T234826#5555449, @Ottomata wrote: >> bacula backups for the analytics databases and the snapshot for the log database should be enough for this use case > Q, will the bacula ba... [13:16:20] ah done! [13:16:22] good [13:17:50] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Ottomata) Oh, I think it was a Q for Luca about how we intended to set that up. I assume we can do it either way. We wouldn't //have// to back up `log` to Bacula if it is too large since we'd h... [13:19:36] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) If the log db can be stored in Bacula it would be great! Otherwise HDFS is fine in my opinion.. [13:47:45] does hive2druid assume an-coord1001 if not specified? [13:49:00] sorry meant druid1001 [13:49:01] seems so [13:50:16] ah I can pass it via puppet [14:04:37] afk for a bit! [14:06:15] RECOVERY - Check the last execution of reportupdater-published_cx2_translations on stat1007 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:10:11] RECOVERY - Check the last execution of reportupdater-published_cx2_translations on stat1006 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:13:12] hey teamm [14:20:40] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10EventBus, and 4 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Ottomata) OK I think I got something working. How's this? ConfigExports extension will be configured using two... [14:21:16] mforns: heyaaa [14:21:29] hey ottomata, thanks for merging the RU fix [14:21:36] sorry I forgot to tell you what was it about [14:21:38] yup, dunno why I didn't see that yesterday! [14:25:32] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10EventBus, and 4 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10jlinehan) >>! In T233634#5548703, @Krinkle wrote: > @Ottomata I'll elaborate next week, but the 2K is not the EL... [14:52:51] super interesting [14:53:13] I discovered today that hive2druid was not set up in the test cluster [14:53:16] so I added it [14:53:20] and now the druid indexation fails [14:53:22] main : run as user is druid [14:53:23] main : requested yarn user is druid [14:53:23] User druid not found [14:53:46] so this is yarn running containers on workers as user 'druid' [14:53:50] that is not there [14:55:55] any user that now runs in hadoop as 'yarn' will need to be added to the workers [14:57:14] hm [14:58:53] that makes sense, yarn will run containers as the user who started them [14:59:09] not as 'yarn' or 'hdfs' [15:00:06] like 'analytics' [15:00:14] we have it on worker nodes [15:27:20] (03PS1) 10Mforns: Add spark job to generate a data quality report [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541557 (https://phabricator.wikimedia.org/T215863) [15:27:46] elukey: aye, w/o kerberos they are all just run as yarn, eh? [15:30:44] (03CR) 10Mforns: [C: 04-2] "Still testing." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541557 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [15:30:50] (03CR) 10jerkins-bot: [V: 04-1] Add spark job to generate a data quality report [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541557 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [15:50:53] ottomata: yes exactly [16:02:13] ping ottomata standdduppp [16:02:33] oh coming [16:19:43] 10Analytics, 10Analytics-Cluster: 500k files in hdfs /tmp - https://phabricator.wikimedia.org/T234954 (10EBernhardson) Poking through the list suggests this is mostly old stuff, only ~1k files are dated 2019. [17:55:40] elukey: still around? [17:58:23] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Help panel: delete sanitized data from before Oct 1 - https://phabricator.wikimedia.org/T234870 (10nettrom_WMF) p:05Triage→03High [18:10:16] 10Analytics, 10Discovery, 10EventBus, 10Wikidata, and 3 others: Log Wikidata Query Service queries to the event gate infrastructure - https://phabricator.wikimedia.org/T101013 (10Ottomata) [18:22:04] 10Analytics, 10Multimedia, 10Tool-Pageviews: Add ability to the pageview tool in labs to get mediarequests per file similar to existing functionality to get pageviews per page title - https://phabricator.wikimedia.org/T234590 (10MusikAnimal) Thanks. I'm looking forward to it! [18:31:28] 10Analytics, 10Multimedia, 10Tool-Pageviews: Add ability to the pageview tool in labs to get mediarequests per file similar to existing functionality to get pageviews per page title - https://phabricator.wikimedia.org/T234590 (10Nuria) Almost there but not quite. [18:36:47] nuria: was your response to Leon bc we're still in vetting mode? Because the metric is ready and backfilled [18:36:47] https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwiktionary%2Fte%2F4%2F40%2Fpeacocks.JPG/daily/2019080100/2019090300 [18:37:31] fdans: yes, let's have him integrate once we have done our vetting [18:49:41] 10Analytics, 10Event-Platform, 10Growth-Team, 10MediaWiki-Watchlist, and 3 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10WDoranWMF) 05Open→03Resolved [19:36:35] milimetric: you around? [19:37:10] hey ottomata yes but I'm heads down on a couple things [19:37:17] k no worries keep oging [19:37:25] ok, I can talk later or tomorrow morning [19:37:28] i was heads down and now i'm heads upisthisagoodidea [19:37:35] i'll be here for an hour or two more [19:37:40] if you go heads up mode [19:37:42] otherwise no worries! [19:37:47] k, I'll ping if so [19:47:58] fdans: yes, let's have him integrate once we have done our vetting [19:53:50] hey a-team: when does Wikistats data update for the most recent month? Is there some common deadline you're working with (e.g. within one week after the turn of the month). [19:54:35] Nettrom: usually only a couple of days, this month there was an extra delay due to an error [19:55:19] milimetric: ah, sorry to hear that! and thanks :) [19:55:58] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Nuria) >archive both db1107 and db1108 on hadoop Sounds good, this last step is about doing a one time backup right? We also... [19:57:38] Nettrom: ofc, and even this month it was done computing on October 5th, and we're just taking a closer look before it's published [19:58:03] btw fdans I should deploy the new snapshot, no? [19:58:16] oops, sorry, forgot timezones exist [19:58:28] ottomata: ok, I'm heads up now :) [19:58:32] cave? [19:59:02] milimetric: ya 2 mins [20:00:08] (03CR) 10Nuria: "Before we get this job running let's vet the data we have to make sure we do not have bugs on our entrophy code by plotting data, looking " [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541557 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [20:13:23] ottomata: wanted to double check something before i kick off all these per-wiki coordinators. Should i do anything special about concurrency? It might not matter, but oozie might also try and schedule all 20 tasks at the exact same moment instead of running one or a couple at a time [20:13:57] this is where a bundle has a few jobs that are performed across all wikis, and then a coordinator per-wiki for ~20 wikis [20:15:06] i wonder in part because i implemented things like auto-limiting resource usage, so for example hyperparam will auto-size memory as necessary to load datasets, and then limit executor counts to use <1TB of aggregate memory. But if 10 jobs run at once they will all limit to 1TB and it probably doesn't help much then [20:19:07] (03CR) 10Nuria: Add spark job to generate a data quality report (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541557 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [20:41:01] 10Analytics, 10Analytics-Kanban, 10Research: Add anomaly detection alarm to detect traffic variations on countries overall - https://phabricator.wikimedia.org/T234484 (10diego) [20:45:21] CindyCicaleseWMF: can you give me an example of an application that now authenticates with wikimedia oauth? [20:46:12] nuria: Phabricator [20:46:20] CindyCicaleseWMF: ah right [20:46:50] CindyCicaleseWMF: wanted to check one thing before i answered [20:47:40] nuria: But, I was surprised to see that there are almost 950 clients registered at https://meta.wikimedia.org/w/index.php?title=Special:OAuthListConsumers&limit=500. I wonder how many of them are no longer being used. [20:56:27] hmm ebernhardson [20:56:59] heh, i guess it it depends on how heavy the jobs are :p [20:57:01] but hm [20:57:28] 10Analytics, 10Analytics-Kanban, 10Research: Add anomaly detection alarm to detect traffic variations on countries overall - https://phabricator.wikimedia.org/T234484 (10diego) Hi, We (research) will be supporting @ssingh on his work related to this problem, especially focused in censorship. @Nuria & @m... [20:58:48] ottomata: well, they all limit themselves to 1TB of memory or 450 cores, whichever is lowe r:P [20:58:57] err, whichever limit hits first [20:59:07] aye, but probably runnign 20 at once [20:59:16] ebernhardson: afaik there isn't a way set bundle concurrency [20:59:27] only coordinators, which is for timed dataset [20:59:30] only a few will actually run for a long time, they all do the same number of iterations but enwiki takes 30-45min per iteration and hewiki takes 3 min [20:59:58] hmm, yea. I suppose i can at least put them in nice queue and see what it does [21:00:34] might help. [21:00:45] ebernhardson: how often to the coords get schudled? [21:00:47] scheduled? [21:00:49] ottomata: weekly [21:00:53] a ok [21:00:55] although it could be monthly, tbh [21:01:00] so throttle / concurrencly likely to matter [21:01:07] but weekly will probably be better to keep things from being broken regularly, or at least noticed [21:01:07] unlikekly* [21:01:08] ha [21:02:17] the most expensive step might also be something we don't have to keep re-running, will have to wait and see. In an ideal world our source data changes slowly and the output of the most expensive part can be reused. will wait and see [21:02:22] ebernhardson: perhaps just setting spark.dynamicAllocation.maxExecutors would be enough? [21:02:40] 10Analytics, 10Analytics-Kanban, 10Research: Add anomaly detection alarm to detect traffic variations on countries overall - https://phabricator.wikimedia.org/T234484 (10Nuria) @Diego @ssingh has done work in the past on this regard . I think the strategy of looking just at countries where we "suspect" dr... [21:03:02] i thnk we have that defaulting to 128 anwyay? [21:03:19] ah nope, no default value for that one [21:03:36] ottomata: 128 would be way too high on this :) That gets set automagically in this step of my app. But it's set so that we have aggregate memory usage of 1TB, or 450 cores, whichever limit it hits first [21:03:44] i could lower that of course [21:04:10] ebernhardson: not sure, worth a try i guess. FYI the cluster only has 4.29 TB mem total [21:04:21] and 1866 cores [21:04:31] ottomata: right, thats why i chose 1tb. if 1 is running at a time 1tb /450 cores seemed fair [21:04:38] but then with more...it's complex :P [21:04:41] right [21:05:19] maybe i can convince oozie to spread the wikis out over the week, instead of running them all at 00:00 on the same day [21:12:44] 10Analytics, 10Analytics-Kanban, 10Research: Add anomaly detection alarm to detect traffic variations on countries overall - https://phabricator.wikimedia.org/T234484 (10diego) @Nuria the entropy approach looks very cool, thanks for sharing. The approach of having suspicious countries sounds dangerous to... [21:41:57] the answer is: not easily. And it doesn't offer any guarantees, delays in the input data could still pile up jobs. hmm. will ponder :) [21:54:10] ottomata: alternate solution, What if i add a new queue to yarn with maxRunningApps=1 (or maybe 2), and then put these specific jobs in it? [22:14:25] 10Analytics, 10Analytics-Kanban, 10Research: Add anomaly detection alarm to detect traffic variations on countries overall - https://phabricator.wikimedia.org/T234484 (10ssingh) Hi @Nuria and @diego. Thank you for your comments. I think I should give some context on my discussions with Diego. From the cen... [23:01:35] 10Analytics, 10Analytics-Kanban, 10Research: Add anomaly detection alarm to detect traffic variations on countries overall - https://phabricator.wikimedia.org/T234484 (10Nuria) I am going to set up a meeting to coordinate efforts. Defining an anomaly is not easy but we can work with a more robust measure t... [23:42:31] (03CR) 10Nuria: "If we have tested these let's merge and make sure to put in train etherpad these jobs have to be started. Completely agree we have to thin" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) (owner: 10Fdans) [23:42:46] (03CR) 10Nuria: [C: 03+1] Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) (owner: 10Fdans)