[05:31:09] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - https://phabricator.wikimedia.org/T258996 (10Mrjyn) [06:38:21] 10Analytics: Check home/HDFS leftovers of drossi/fsalutari - https://phabricator.wikimedia.org/T258788 (10elukey) * drossi ` ====== stat1004 ====== total 0 ls: cannot access '/var/userarchive/drossi.tar.bz2': No such file or directory ====== stat1005 ====== total 0 ls: cannot access '/var/userarchive/drossi.t... [06:42:13] !log re-run webrequest-load hour 2020-7-28-3 [06:42:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:35:04] 10Analytics-Radar, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, and 2 others: Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10elukey) 05Open→03Resolved [07:35:07] 10Analytics-Radar, 10Product-Analytics, 10Release-Engineering-Team, 10Repository-Admins: Create a repository and user for Product Analytics Oozie jobs - https://phabricator.wikimedia.org/T230743 (10elukey) [07:36:13] 10Analytics-Radar, 10Product-Analytics, 10Release-Engineering-Team, 10Repository-Admins: Create a repository and user for Product Analytics Oozie jobs - https://phabricator.wikimedia.org/T230743 (10elukey) Just completed the creation of the `analytics-product` system user + kerberos keytab, you should now... [08:03:34] !log Superset migrated to CAS [08:03:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:03:35] * elukey dances [08:12:01] 10Analytics, 10Analytics-Kanban: Alarm when druid indexation fails - https://phabricator.wikimedia.org/T254493 (10elukey) @Ottomata since we are now running the jobs on an-launcher1002, maybe we could try to run them with spark local mode? It should add some overhead but we have a lot of unused ram atm and it... [08:14:02] 10Analytics, 10Product-Analytics: Streamline Superset signup and authentication - https://phabricator.wikimedia.org/T203132 (10elukey) Turnilo and Superset now use the CAS SSO portal to authenticate, so `Log at superset.wikimedia.org, remembering to use the UNIX shell username from step 1.` is finally not need... [09:10:54] !log temporarily stop eventlogging file consumers on eventlog1002 to copy some data over to stat1005 (/srv partition full) [09:10:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:18:18] on eventlog1002 we have some files from 2018 under archive [09:18:37] that take a ton of space, ~350G [09:18:50] not sure if they are needed or not, I am copying them to stat1005 to be sure [09:48:15] !log re-enable eventlogging file consumers on eventlog1002 [09:48:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:49:08] * elukey errand + lunch [10:32:46] 10Analytics-Radar, 10Operations, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10MoritzMuehlenhoff) Hue fails to start due to some conflicts between the system Python and the modules bundled by Hue: > Jul 28 10:17:38 an-tool1009 systemd[1]: Failed to start LSB: H... [11:58:09] 10Analytics-Radar, 10Operations, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) ` elukey@an-tool1009:~$ /usr/lib/hue/build/env/bin/python2.7 --version Python 2.7.9 elukey@an-tool1009:~$ /usr/lib/hue/build/env/bin/python2.7 Python 2.7.9 (default, Mar 1 20... [12:12:24] 10Analytics-EventLogging, 10Analytics-Radar, 10QuickSurveys, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10Isaac) > it should be possible to test this explanation. We can make QuickSurveys use... [12:14:51] 10Analytics-Radar, 10Product-Analytics, 10Structured-Data-Backlog: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10Miriam) Thanks @Morten for opening this task! A few use cases below: * In our work on image recommendation for unillustr... [12:27:26] 10Analytics-Radar, 10Operations, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10MoritzMuehlenhoff) This commit landed in v2.7.14rc1 (and Stretch has 2.7.13): https://github.com/python/cpython/commit/3e37f4a11547a226c3c2f8bd612510465db397b9 [12:38:01] 10Analytics, 10Analytics-Kanban: Alarm when druid indexation fails - https://phabricator.wikimedia.org/T254493 (10Ottomata) Sure! [13:48:32] 10Analytics: Stop saving eventlogging data on eventlog1002 - https://phabricator.wikimedia.org/T259030 (10Ottomata) [13:50:24] ottomata: o/ did you clean up eventlog1002? [13:51:01] ah yes you did [13:51:09] okok makes sense :) [13:51:31] this morning I moved a big 160G file in my home dir on stat1005, to free 160G of space [13:51:51] it was from 2018, but I didn't want to start a clean up without asking the team first [13:52:28] elukey: yeah ssorry was pinged in ops answered there [13:52:32] a removed a few files [13:52:37] from /srv/log/eventlogging [13:52:42] just filed ^^^ [13:53:54] ottomata: I guess that I can remove the file that I copied this morning from 2018 on stat1005 then [13:53:57] :) [13:54:15] I am +1 on removing file-based consumers, seems that we can stop them [13:56:11] hi alll! [14:10:33] ya [14:10:36] hello! [14:16:32] a-team: please try to log in into Superset when you have a moment, it has been migrated to CAS-SSO [14:19:37] ottomata: disk space on eventlog1002 again raising the alert [14:20:14] ah wait there are 158G in nuria's home [14:21:10] that were for https://phabricator.wikimedia.org/T219842, more than a year ago [14:21:20] nuria: can we drop those? [14:23:42] wow why is it filling so fast all of the sudden [14:23:50] the data doesn't seem larger [14:25:46] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-24h&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eventlogging-client-side looks stable [14:27:37] ottomata: maybe we can try to copy client-side-events.log-20200728 (under archive) on a stat host for the moment [14:27:46] it is around 300G [14:28:28] in the meantime, I am stopping the client side consumer [14:29:38] !log stop client-side-events-log.service on eventlog1002 to avoid /srv to fill up [14:29:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:30:38] started the copy on stat1005 [14:31:18] in theory we could simply remove the consumer [14:32:55] elukey: can we stop the all events consumer instead? [14:33:01] it is potentially less useful [14:33:09] oh but client side is way bigger him [14:33:15] yes [14:33:19] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?panelId=33&fullscreen&orgId=1&from=now-2d&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eventlogging-client-side [14:33:29] it matches when I restarted eventlogging yesterday [14:33:37] OH [14:33:41] is it reconsuming for some reason? [14:33:51] from beginning of topci?? [14:33:51] I think so, really weird [14:34:30] yes, i see events from 2020-07-20 in archive/client-side-events.log-20200728 [14:34:38] after https://gerrit.wikimedia.org/r/c/operations/puppet/+/616445 [14:35:09] maybe eventlogging_valid_mixed_schema_whitelist changed triggers somehow a change in the consumer? [14:35:21] just curious why did we remove that from valid mixed fliter? [14:35:33] it shouldn't have [14:36:12] it should commit offsets, maybew it wasn't committing them? [14:36:18] would burrow know? [14:36:19] so there was an alert landed for a single event not conforming to a schema, and on meta the schema looked deprecated, so I filed a patch to remove it [14:36:29] ah ok cool [14:36:30] but I may have done the wrong thing :( [14:36:46] naw is fine, it wouldnt' matter if it was in valid mixed filter or not [14:36:47] but that's fine [14:37:04] that won't stop the event from coming in though [14:37:20] eventlogging-processor will still ltry to validate it [14:37:32] ottomata: https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&from=now-24h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_consumer_client_side_events_log_00 [14:38:03] weird, its like your restart caused it to reset for some reason [14:38:09] it looks like it was committing offsets fine [14:38:28] well is it back to normal now? [14:38:34] can we just leave it on and delete some files? [14:38:48] probably in standup we'll decide just to turn it off anyway [14:38:54] i guess you can just leave it off [14:39:14] yep yep weird though [14:47:57] 10Analytics: Stop saving eventlogging data on eventlog1002 - https://phabricator.wikimedia.org/T259030 (10elukey) It looks like the client-side consumer restarted from the beginning of the topic after my restart of eventlogging daemons yesterday: https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgI... [14:58:39] 10Analytics, 10User-Elukey: Test if Hue can run with Python3 - https://phabricator.wikimedia.org/T233073 (10elukey) I tested today building Hue 4.7.1 on a Buster VM, using python3.7 as target, and Hue seems to start fine (the development server at least). Given T258768 I'd be inclined to see if we can package... [15:41:22] nuria: hii yt? got a sec for a brain bounce based on josephs review of my event ingestion patch? [15:41:36] i'm trying to do some things he wanted (and you would probably want too) and it is making some things harder [15:41:41] ottomata: on meeting, we can talk after standup [15:41:44] k [16:09:07] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10jcrespo) There is 2 ongoing issues: https://github.com/wikimedia/operations-dns/commit/29ff38c263c9f7b4fd366aeb1999ebc4f2d3d8a5 added... [16:22:07] 10Analytics: Stop saving eventlogging data on eventlog1002 - https://phabricator.wikimedia.org/T259030 (10Nuria) The only place where these files are used is on beta, to test event emission. [16:39:28] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10jcrespo) So the first test about the backups seems to indicate the it works, but we will need some tuning about the backup options and... [16:39:39] 10Analytics, 10Product-Analytics: Investigate accessing superset via internal VPN or google oauth - https://phabricator.wikimedia.org/T258962 (10Nuria) [16:43:28] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-consumer@client-side-events-log https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [16:46:07] this is downtime expired --^ [17:03:19] elukey: what do you mean ^ [17:05:09] 10Analytics-Radar, 10Product-Analytics (Kanban): Calculate impact of missing mobile app pageviews to high-level metrics - https://phabricator.wikimedia.org/T257373 (10SNowick_WMF) 05Open→03Resolved [17:05:11] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Epic: API pageview counts for 'Mobile app' are incorrect since switch to mobile-html - https://phabricator.wikimedia.org/T256508 (10SNowick_WMF) [17:05:28] 10Analytics-Radar, 10Product-Analytics (Kanban), 10Readers-Web-Backlog (Needs Product Owner Decisions), 10covid-19: Weekly updates on readers - https://phabricator.wikimedia.org/T248426 (10kzimmerman) We focused on readers and opted not to monitor page previews & unique devices. See https://superset.wikim... [17:06:37] 10Analytics-Radar, 10Product-Analytics (Kanban), 10Readers-Web-Backlog (Needs Product Owner Decisions), 10covid-19: Weekly updates on readers - https://phabricator.wikimedia.org/T248426 (10kzimmerman) [17:06:51] 10Analytics-Radar, 10Product-Analytics (Kanban), 10Readers-Web-Backlog (Needs Product Owner Decisions), 10covid-19: Weekly updates on readers - https://phabricator.wikimedia.org/T248426 (10kzimmerman) 05Open→03Resolved [17:06:53] 10Analytics-Radar, 10Product-Analytics (Kanban), 10Readers-Web-Backlog (Needs Product Owner Decisions), 10covid-19: Weekly updates on editors & readers - https://phabricator.wikimedia.org/T247873 (10kzimmerman) [17:07:07] 10Analytics-Radar, 10Product-Analytics (Kanban), 10Readers-Web-Backlog (Needs Product Owner Decisions), 10covid-19: Weekly updates on editors & readers - https://phabricator.wikimedia.org/T247873 (10kzimmerman) 05Open→03Resolved [17:09:15] mforns: ah yes sorry, it is related to the problem that me and andrew were talking about, I stopped one EL consumer on eventlog1002 and added icinga downtime (2h), but it expired [17:09:20] to we got the alart [17:09:22] *alarm [17:09:28] ok ok [17:09:36] I already restored the daemon, and Andrew is going to absent it [17:13:52] 10Analytics, 10Product-Analytics: Investigate accessing superset via internal VPN or google oauth - https://phabricator.wikimedia.org/T258962 (10kzimmerman) [17:15:45] 10Analytics-Radar, 10Product-Analytics, 10Structured-Data-Backlog: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10LGoto) p:05Triage→03Medium a:03nettrom_WMF [17:19:01] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) Fixed the DNS issue with https://gerrit.wikimedia.org/r/c/operations/dns/+/616864, thanks a lot for spotting it! My bad :( Ab... [17:21:09] (03PS3) 10Milimetric: Remove outdated IOS pageview code [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/616591 (https://phabricator.wikimedia.org/T257860) (owner: 10Nuria) [17:21:16] (03CR) 10Milimetric: [C: 03+2] Remove outdated IOS pageview code [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/616591 (https://phabricator.wikimedia.org/T257860) (owner: 10Nuria) [17:21:40] 10Analytics, 10Product-Analytics, 10Research: Annotate pageview data to alert users that previously included mobile app pageview data is NOT included in refined pageview datasets - https://phabricator.wikimedia.org/T258535 (10SNowick_WMF) a:03SNowick_WMF [17:21:52] 10Analytics, 10Product-Analytics, 10Research: Annotate pageview data to alert users that previously included mobile app pageview data is NOT included in refined pageview datasets - https://phabricator.wikimedia.org/T258535 (10LGoto) p:05Triage→03Medium [17:26:41] (03CR) 10Milimetric: "Looks good. Next steps would be for @Tsevener and @Dbrant to +1 or if they'd like to test this out, I can pair with them and show them ho" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/616629 (https://phabricator.wikimedia.org/T257860) (owner: 10Nuria) [17:27:26] (03Merged) 10jenkins-bot: Remove outdated IOS pageview code [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/616591 (https://phabricator.wikimedia.org/T257860) (owner: 10Nuria) [17:35:50] elukey: labs-pribate needs fake analytics-product keytabs on all stat boxes, right? [17:36:01] getting a PCC fail on stat1007 right now because it is missing [17:36:09] ottomata: it does yes, I haven't added it yet, lemme do it [17:36:13] ok thank you! [17:37:00] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging [17:38:24] (03CR) 10Dbrant: [C: 03+1] "The logic looks good to me, but I would be interested in seeing how to test this out. And just to be absolutely certain, the name of the " [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/616629 (https://phabricator.wikimedia.org/T257860) (owner: 10Nuria) [17:39:20] ottomata: should be ok now! [17:44:09] great stuff elukey [17:44:09] https://puppet-compiler.wmflabs.org/compiler1003/24193/ [17:44:36] elukey: ok if I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/616871 [17:44:36] ? [17:45:17] beta is incluing the ::files profile class directly [17:45:20] so it will not be affected [17:45:49] ahhh I was about to ask, super, +1 [17:45:54] looks super good [17:46:12] i'll make sure the relevant configs are removed [17:46:16] <3 [17:46:17] from the hosts [17:51:58] elukey: -2 cronjobs :) [17:52:38] !log stopped riting eventlogging data log files on eventlog1002 and stopped syncing them to stat100[67] - T259030 [17:52:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:52:43] T259030: Stop saving eventlogging data on eventlog1002 - https://phabricator.wikimedia.org/T259030 [17:55:23] woooww [18:02:14] * elukey off! [18:25:47] 10Analytics, 10Product-Analytics: Streamline Superset signup and authentication - https://phabricator.wikimedia.org/T203132 (10kzimmerman) [18:53:40] 10Analytics-Radar, 10Datasets-General-or-Unknown, 10Product-Analytics, 10Structured-Data-Backlog: Set up generation of JSON dumps for Wikimedia Commons - https://phabricator.wikimedia.org/T259067 (10nettrom_WMF) [18:56:21] nuria: i am on my way to making a new java repo/project [18:56:23] :) [19:07:21] ottomata: OK! [20:10:23] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [20:10:36] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [20:13:36] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) I suspect that @Ottomata is the person to notify when these are ready for handoff, so I've added him as a subscriber. @Ottomata: if this should be som... [20:19:25] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Ottomata) Let's add @elukey too! [20:25:13] nuria: i've pushed a bit more what i'm working on to github, but i'm still having some of the same problems [20:25:22] it isn't all there yet, just some basics [20:25:27] so maybe it will be easier to grok more quickly [20:25:31] ottomata: k [20:25:35] https://github.com/ottomata/wikimedia-eventutilities [20:25:45] basically, this stuff [20:25:45] https://github.com/ottomata/wikimedia-eventutilities/tree/master/eventutilities-wikimedia/src/main/java/org/wikimedia/eventutilities/wikimedia/event [20:25:54] it doesn't quite feel right [20:26:08] really, this is ALL wikimedia specific [20:26:17] I can't figure out how to make a useful non wikimedia specific thing [20:26:27] Schema loaders, easy and fine [20:26:30] but stream config is so specific [20:27:50] maybe that is ok? maybe since i'm calling this library wikimedia-eventutilities [20:27:59] i can just mash it together? [20:28:14] e.g. i could put the things I need in just EventStreamConfig + EventStream [20:28:27] but not have any special wikimedia.event package and Wikimedia* classes? [20:29:02] ottomata: i think that sounds right, the streamconfig is very wikimedia [20:29:31] ok if you thikn that's ok, i'll try that. [21:21:41] 10Analytics: Stop saving eventlogging data on eventlog1002 - https://phabricator.wikimedia.org/T259030 (10Nuria) Deleted big files from homedir [22:29:55] 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: Upgrade Druid to its latest upstream version (currently 0.18.1) - https://phabricator.wikimedia.org/T244482 (10mforns) Uou, after a couple trials, managed to test it properly, and it looks it's working! This ingestion spec re-compacts already indexed h...