[01:14:19] PROBLEM - Check the last execution of reportupdater-browser on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:24:57] RECOVERY - Check the last execution of reportupdater-browser on stat1007 is OK: OK: Status of the systemd unit reportupdater-browser https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:01:34] (03CR) 10Elukey: [C: 03+2] Move codebase to python3 [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/537268 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey) [06:02:25] !log set python3 for all report updater jobs on stat1006/7 [06:02:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:05:54] all right report updater now set for python3 [06:07:10] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10elukey) [06:15:08] and now, setting stat1005 to be used by analytics users :) [06:15:20] it has been a long time [07:19:03] 10Analytics: Check home leftovers of smalyshev - https://phabricator.wikimedia.org/T231861 (10elukey) @EBernhardson ping :) [07:21:05] aaand stat1005 ready :) [07:21:11] I added some docs here and there [07:21:22] https://wikitech.wikimedia.org/wiki/Stat1005 [07:21:50] https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Host_access_granted [07:22:02] https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients [08:15:56] 10Analytics, 10Analytics-Kanban: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10elukey) Marcel verified that the code runs correctly with python3, and today I deployed it and updated the systemd timers on stat1006/7. Everything seems working as expected! [08:37:50] !log add netflow realtime ingestion alert for Druid [08:37:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:38:59] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10elukey) Added realtime ingestion specs to Refinery and added the alarm for realtime data, task is completed! [08:39:09] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10elukey) [09:08:02] "attributes" is one of those words that you spend a whole morning look at and it starts looking super weird [09:11:01] good morning :) [09:12:54] helloo luca [09:22:00] 10Analytics, 10Better Use Of Data, 10Product-Infrastructure-Team-Backlog, 10Epic: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10phuedx) @fgiunchedi: Sorry for the belated ping but has the > Write minimal client to send errors without attempting normalization f... [10:10:36] joal: do you remember a long time ago that we agreed that we would get rid of the intervals in pageviews per country? [10:10:38] https://wikimedia.org/api/rest_v1/metrics/pageviews/top-by-country/all-projects/all-access/2019/08 [10:10:46] I just realized they are still there [10:25:21] elukey, what was the other deletion timer that is using a script other than refinery-drop-older-than? I can only find one in data_purge.pp, but I recall there was another one in another file...? [10:27:02] maybe modules/profile/manifests/analytics/refinery/job/test/data_purge.pp? [10:27:05] it was using refinery-drop-hive-partitions [10:27:22] or modules/profile/manifests/analytics/search/jobs.pp? [10:27:26] yes exactly [10:27:35] and this one?-> modules/profile/templates/analytics/refinery/job/refinery-drop-mediawiki-xmldumps-pages_meta_history.sh.erb? [10:27:36] in data_purge refinery-drop-query-clicks is absented [10:27:43] yes [10:27:44] going to remove it from puppet [10:27:49] ah ok [10:28:10] elukey, do you want me to remove it? [10:28:17] as part of the change? [10:28:28] nono doing it now [10:29:26] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/539084/ [10:30:59] refinery-drop-mediawiki-xmldumps-pages_meta_history.sh.erb should be using drop-older-than right? [10:31:23] yes yes [10:31:53] so, there's only 2 files: search/jobs.pp and test/data_purge.pp [10:32:01] can I change both? [10:32:27] of course! [10:32:30] cool [10:32:39] the test data_purge is my fault, it is the testing stuff :( [10:33:29] elukey, does it make sense to copy-paste the commands from data_purge.pp over to test/data_purge.pp for drop-webrequest-raw/refined-partitions? [10:39:55] yep! [10:40:03] ah no sorry [10:40:06] the datasource is different [10:40:13] it is called webrequest-test or something [10:40:22] that changes the args :( [10:48:40] * elukey lunch! [12:25:43] Hi team - I kinda expected it but didn't wanted to jinx it - Lino is now sick as well, and so am I (not stomack, but head) - Doing my best to keep things rolling, but it'll be sloooooowwwwww [12:35:31] joal: :( [12:50:52] 10Analytics: Superset + Turnilo access for Verena Lindner + Raja Gumienny (WMDE) - https://phabricator.wikimedia.org/T231677 (10Verena) @Nuria not yet (was on vacation). I just created the request https://phabricator.wikimedia.org/T233807 Should I add any subscribers to the ticket? [13:17:49] first lovely discovery [13:18:20] zookeeper on buster gets shipped with jars that are compiled for a version of java that is not compatible with 8 [13:18:30] most likely 11, but probably retro-compatible up to 9 [13:18:46] ending up in issues like https://github.com/plasma-umass/doppio/issues/497 [13:24:28] ohboy [13:24:45] morning :) [13:25:32] morning! [13:25:38] i'm at the product infrastructure offsite today! [13:25:45] sso will miss meetings (except for the MEP eng sync) [13:26:04] ack! [13:26:11] who is the commander in chief? [13:26:19] probably Dan :D [13:37:26] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [13:37:41] !log move the Hadoop test cluster to the Analytics Zookeeper cluster [13:37:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:49:49] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Gilles) [13:50:43] 10Analytics, 10Discovery-Search, 10Multimedia, 10Reading-Admin, and 3 others: Image Classification Working Group - https://phabricator.wikimedia.org/T215413 (10Gilles) [13:54:30] elukey, I have a problem with the refinery-drop-older-than script, when we designed it, we added this protection, the undeletable_paths, that right now prevent anything from /wmf/data/archive to be deleted. And the mediawiki history dumps are there. [13:55:14] elukey, I think we can remove that layer of protection, it is very difficult to delete something unwillingly with that script [13:55:44] or else, we could add an --ignore-undeletable flag, that would skip the check [13:55:51] but I really think this would be too much... [13:55:52] mforns: I like the latter [13:56:02] hmm [13:56:02] even if... mmmm [13:56:12] no no it doesn't add anything [13:56:28] if we allow one script to delete in there it can potentially delete all [13:56:36] so no point in adding the flag, okok [13:56:59] so given the fact that we are sure about the script (battle tested up to now), I am +1 on the idea [13:57:12] but I think that Nuria or somebody else should +1 as well [13:57:15] ok, let me think if there are more options.. [13:57:16] just to be on the same page [13:57:19] sure [13:58:18] one possible danger is that the regexp that specifies the path format does by mistake accept paths that are not to be deleted... [13:59:09] but if specified correctly, the base-path (which is not a regexp) should restrict deletions to a given directory tree [14:00:36] hmmm, maybe it's simpler to add the --ignore-undeletable flag [14:00:50] orrrr [14:02:04] we remove /wmf/data/archive from undeletable_paths and add all other /wmf/data/archive/* [14:02:18] very ad-hoc but easy and safe [14:02:39] * (except for mediawiki) [14:03:19] or we can have a whilelist of allowed path for /wmf/data/archive [14:03:23] if not in it, bail out [14:07:00] aha [14:09:15] elukey: looks like everything in smalyshev's can be purged [14:10:24] ebernhardson: ack! can you add it to the task just to keep archives happy? [14:11:02] ottomata: so unfortunately...none of my oozie jobs have progressed this week. Looks like backfill started sept-24 14:00, which is probably when you added new partitions, but no earlier partitions have shown up [14:13:21] i couldn't find anything in grafana related to camus, so not sure how to see how it's progressing, but seems it should have caught up half a day or so by now [14:18:14] we don't have metrics for camus, it runs via systemd timer periodically so it is difficult to have prometheus metrics (maybe graphite but not sure if those are supported) [14:18:40] but we have checks in place [14:19:00] what data is missing?? [14:19:44] elukey: event.mediawiki_cirrussearch_request, from which we derive search click through data. It stopped 2019-09-21 00:00:00, and otto added more partitions yesterday [14:20:08] apparently it backed up at camus between kafka and hdfs [14:21:49] yeah we had issues over the weekend [14:25:07] interesting [14:25:08] elukey@an-coord1001:~$ ls /mnt/hdfs/wmf/data/raw/event/eqiad_mediawiki_cirrussearch-request/hourly/2019/09/ [14:25:18] (removing tabs sigh) [14:25:34] elukey@an-coord1001:~$ ls /mnt/hdfs/wmf/data/raw/event/eqiad_mediawiki_cirrussearch-request/hourly/2019/09/ [14:25:38] 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 24 25 [14:25:44] so 21 22 23 are missing [14:30:47] elukey: right. otto thought after he added partitions the remaining partition would start catching up, but it hasn't even emit 1 hour at the beginning yet [14:31:43] ebernhardson: it is importing data now, there is a new camus job for high volume topic that Andrew created [14:32:06] I can see data in /mnt/hdfs/wmf/data/raw/event/eqiad_mediawiki_cirrussearch-request/hourly/2019/09/25/14 [14:32:42] elukey: right, after otto added new partitions those new partitions are importing [14:32:56] there is a screen session on an-coord1001, so I am not sure if he is trying to backfill [14:34:13] ok to restart Apache on the various an/analytics-tool hosts? (for Expat sec update) [14:36:58] moritzm: yep! [14:37:36] k [14:39:02] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [14:39:10] done [14:40:47] thanks :) [14:41:13] ebernhardson: we'll have to wait for Andrew to check IRC (he is at the product infra offsite today) [14:41:27] I can't see what's happening on his screen session [14:55:41] joal: if the malaises have not completely taken you out would you have a couple minutes in the bc? I'd like to ask you a quick question [15:03:11] elukey, can you help me? :] I'm trying to execute a test with refinery-drop-older-than in an-coord1001.eqiad.wmnet and getting some issuessss [15:03:27] mforns: of course [15:03:34] 5 euros [15:03:35] :D [15:03:42] xD [15:03:53] ok, done [15:04:23] if I execute python3 in stat1007 it doesn't have mock lib installed [15:04:40] so I tried an-coord1001, which is the machine that actually executes the script in prod [15:04:48] and it does have mock! [15:04:50] ah yes I found the sme issue on an-coord1001, and installed manually the lib [15:05:02] I have do add it to the shared libs in puppet to deploy [15:05:03] but... my user (mforns) does not have access to hive [15:05:26] java.io.FileNotFoundException: /etc/hive/conf.analytics-hadoop/hive-site.xml (Permission denied) [15:05:32] stat1007 fixed [15:05:36] if I... [15:05:41] xD, that easy? [15:05:41] yes basically that file contains passwords on an-coord1001 [15:05:54] ok cool!, but one question [15:06:11] if I executed the script under sudo -u analytics in an-coord1001, it should work no? [15:06:45] but python3 under analytics user does not know the refinery module, even if pythonpath is correct... [15:07:07] how did you run it? [15:07:08] and I get:ImportError: No module named refinery [15:07:29] sudo -u analytics bin/refinery-drop-older-than --database='discovery' ... [15:07:37] ah wait, you exported in your session, then sudo -u ? [15:07:45] sudo -u analytics echo $PYTHONPATH seems correct [15:08:00] maybe.. [15:08:13] no, I didn't export in my session [15:08:20] mforns: can you try sudo -E -u ? [15:08:24] ok [15:08:34] same [15:08:41] no wait, how did you set PYTHONPATH? [15:09:53] I didn't! [15:09:57] so [15:09:59] elukey@an-coord1001:~$ sudo -u analytics PYTHONPATH=:/srv/deployment/analytics/refinery/python python3 works [15:10:00] it was already set when I ssh [15:10:08] yes okok [15:13:19] mforns: to quickly solve your issue, sudo -u analytics PYTHONPATH=etc.. command [15:13:28] ok [15:14:03] TIL https://stackoverflow.com/questions/35824788/sudo-e-does-not-pass-pythonpath [15:14:15] so this seems to be the reason why sudo -E doesn't work [15:14:34] I am used to explicitly add PYTHONPATH when sudoing, never encountered this [15:14:45] ah ok ok [15:14:49] thanks elukey :] [15:14:51] but in general, when you sudo -u you don't preserve your environment [15:15:02] with sudo -E, you force sudo to pick the current env [15:15:07] unless it is blacklisted :D [15:15:31] going to add python3-mock to the shared packages [15:18:24] ebernhardson: on a separate note - ROCm on stat1005 seems working fine now with tensorflow [15:18:36] I added some docs to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU [15:18:42] have you had the chance to test it? [15:22:05] elukey: not recently, i'll have to see if I have something to test [15:23:24] ebernhardson: no rush, we are planning to release stat1005 to all analytics privatedata users soon [15:23:37] I was just curious if there was anything not working etc.. [15:23:55] and also, if we need to buy more :) [15:24:06] Gone for kids doctor appointment - Will miss standup - sorry team [15:26:50] (03PS7) 10Fdans: Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) [15:43:40] 10Analytics, 10Better Use Of Data, 10Product-Infrastructure-Team-Backlog, 10Epic: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10fgiunchedi) >>! In T226986#5521772, @phuedx wrote: > @fgiunchedi: Sorry for the belated ping but has the > >> Write minimal client t... [15:46:51] (03PS1) 10Mforns: Improve refinery-drop-older-than after python3 migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/539146 (https://phabricator.wikimedia.org/T204735) [15:47:03] 10Analytics, 10Operations, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10herron) p:05Triage→03Normal [15:48:04] (03PS2) 10Mforns: Improve refinery-drop-older-than after python3 migration [analytics/refinery] - 10https://gerrit.wikimedia.org/r/539146 (https://phabricator.wikimedia.org/T204735) [15:49:02] (03CR) 10Mforns: "This has been tested :]" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/539146 (https://phabricator.wikimedia.org/T204735) (owner: 10Mforns) [15:54:10] (03CR) 10Elukey: Improve refinery-drop-older-than after python3 migration (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/539146 (https://phabricator.wikimedia.org/T204735) (owner: 10Mforns) [16:32:16] bad news team :( Naé has a germ (https://en.wikipedia.org/wiki/Campylobacter) and will go to hospital tonight (normally not for long) - Lino will need to stay home until we are sure he doesn't have the thing (highly contagious) [16:32:34] I'll need to be off for real tomorrow onward, hopefully back on Monday [16:34:29] joal: ack, take care of her :( [16:40:35] (03CR) 10Mforns: Improve refinery-drop-older-than after python3 migration (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/539146 (https://phabricator.wikimedia.org/T204735) (owner: 10Mforns) [16:43:38] joal, don't worry, lots of energy! [16:56:38] * elukey off! [17:11:18] (03PS8) 10Fdans: Add oozie job to load top mediarequests data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) [18:23:55] 10Analytics, 10Desktop improvements, 10EventBus, 10Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1): [SPIKE 8hrs] How will the changes to eventlogging affect desktop improvements - https://phabricator.wikimedia.org/T233824 (10Ottomata) [18:28:36] 10Analytics, 10Desktop improvements, 10EventBus, 10Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1): [SPIKE 8hrs] How will the changes to eventlogging affect desktop improvements - https://phabricator.wikimedia.org/T233824 (10Ottomata) See also: [[ https://docs.google.com/document/d/1dpCo33RpZAbQG15n... [18:50:45] a-team ok gang, the most requested media file from english wikipedia for all the month of August is... [18:50:47] https://upload.wikimedia.org/wikipedia/commons/5/55/WMA_button2b.png [18:50:57] exciting! [18:51:27] :225,422,827 requests [18:52:24] (03CR) 10Fdans: [V: 03+1] "Verified also for monthly granularity" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) (owner: 10Fdans) [18:53:21] \o/ fdans!!! [18:53:35] it worked! :D [18:53:51] see, don't lose your hope [18:53:56] mforns: second try haha [18:54:11] ah, ok, neverthelesssss [18:54:15] about 1:35 each job [18:55:02] super fine [19:34:43] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10EBernhardson) Doesn't look like this is catching up. New data is arriving again from the new partitions, but the previous data does not appear to... [19:53:45] 10Analytics, 10Fundraising-Backlog, 10Operations, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) >>! In T233636#5517703, @EYener wrote: > Tools / Data Sources > Turnilo > Superset Afaict LDAP... [19:53:59] 10Analytics, 10Fundraising-Backlog, 10Operations, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) [20:12:26] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Ottomata) Ok, something is wrong with Camus and the eqiad.mediawiki.cirrussearch-request topic partition 0. Camus seems stuck on offset 231655046... [20:12:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Ottomata) p:05Triage→03High