[00:50:26] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:05:57] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Jseddon)
[03:35:02] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Jseddon) I've just posted some context for this here: https://lists.wikimedia.org/pipermail/wikimedia-l/2020-June/095008.html  More information and documentation will be coming s...
[05:06:56] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10ArielGlenn) >>! In T254275#6222378, @GereonK wrote: > WMF is actually planning to charge for acess? So what percentage of this do the creators of the content get? Otherwise it's...
[05:16:47] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Decomission notebook hosts - https://phabricator.wikimedia.org/T249752 (10elukey) No late access to the hosts, the user migration step seems to have worked!
[05:17:29] <fdans>	 elukey: o/
[05:17:32] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Decomission notebook hosts - https://phabricator.wikimedia.org/T249752 (10elukey)
[05:17:56] <elukey>	 fdans: hola!
[05:18:25] <fdans>	 elukey: are you SUMPER PUMPED for our interview today?
[05:18:31] <fdans>	 I could barely sleep in anticipation
[05:20:44] <elukey>	 ahahhahah
[05:21:13] <elukey>	 I think it will be interesting
[05:29:21] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10elukey) @Gilles as FYI it re-happened for hour 2019-06-14T08, I forced the refine for NavigationTiming to skip the problematic records :)
[05:32:16] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:35:27] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Gilles) @elukey do you mean that requires manual work for you every time this happens?
[05:44:26] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10elukey) @Gilles refine stops when these kind of events happen to be on the safe side, it is just a line of bash to make it re-run skipping the problematic fields (we could automate it but it h...
[05:48:36] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Nuria) >My understanding is that we are talking about extremely high volume access here Does not seem that would be the case since the paid API is restricted to a few clients and...
[05:51:59] <elukey>	 nuria: hola! :)
[05:52:39] <wikibugs>	 10Quarry, 10Data-Services, 10cloud-services-team (Kanban): Quarry or the Analytics wikireplicas role creates lots of InnoDB Purge Lag - https://phabricator.wikimedia.org/T251719 (10Marostegui) 05Open→03Resolved a:03Marostegui I am going to close this as resolved, as looks like that placing labsdb1010 a...
[06:00:20] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10ArielGlenn) @Nuria I was referring to high volume access to the API. See  https://lists.wikimedia.org/pipermail/wikimedia-l/2020-June/095008.html for the broader context.
[06:15:51] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Gilles) Sure, but if this happens on a weekly basis now it's disruptive. The users with the broken browsers may continue browsing the site and generate those records every time they get sample...
[06:48:35] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Kelson) FYI: Because it seems there is a knowledge/communication gap about openZIM/Kiwix dumping solution, a Tech talk is currently being planned (probably in August) https://pha...
[08:08:39] <joal>	 Hi team
[08:17:06] <elukey>	 bonjour!
[08:24:12] <wikibugs>	 10Analytics-Radar, 10Fundraising-Backlog, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech: Find a better way for WMDE to get impression counts for their banners - https://phabricator.wikimedia.org/T243092 (10kai.nissen) > We can provide realtime data with a bit of work for the even...
[08:51:02] <joal>	 djellel_: your spark job eats 75% of cluster resources (3TB of ram) - Can you do somehting please?
[08:51:09] <wikibugs>	 10Analytics-Radar, 10CPT Initiatives (Revision Storage Schema Improvements), 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), 10Technical-Debt: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Marostegui)
[08:54:28] <djellel_>	 joal: ooh, that much? ok, I wait a bit to finish, and check my config.
[08:55:00] <joal>	 djellel_: you can monitor your jobs at https://yarn.wikimedia.org/cluster/scheduler
[08:56:02] <joal>	 driver memory 40G, executor memory 30G, no limit over dynamic-allocation executors
[09:09:24] <elukey>	 djellel_: hi! This seems to be not the first time that Joseph tells you to verify your config, we are usually very open to follow up with whoever needs to use the cluster but we don't particularly love to keep repeating the same thing over and over.. When you run spark jobs, can you please check resource usage via Yarn?
[09:10:01] <elukey>	 There are very important jobs running daily, plus other people that need to use the cluster
[09:10:43] <elukey>	 I know that you didn't make it on purpose to fill up the cluster, just try to check more often resource usage in the future :)
[09:11:26] <icinga-wm>	 PROBLEM - Check the last execution of refinery-druid-drop-public-snapshots on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit refinery-druid-drop-public-snapshots https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:11:40] <joal>	 elukey: --^ :(
[09:16:02] <elukey>	 ah! I know why
[09:16:13] <elukey>	 it is the great firewall
[09:16:53] <elukey>	 I didn't add the new hosts, they are not inside the analytics vlan
[09:16:56] <elukey>	 just realized
[09:17:43] <joal>	 Ah! similar issue that the kafka one IIUC
[09:20:26] <elukey>	 I didn't check the logs but I am pretty sure it is the issue
[09:21:23] <joal>	 right elukey - Now ou can guess my next question I guess - Where should we add some doc to try and make sure we don't forget network next time we add machines ;)
[09:22:46] <elukey>	 joal: we can add documentation but it is easy to forget, the vlan firewall is not in every host and difficult to handle
[09:23:41] <joal>	 elukey: I definitely understand the difficulty in term of non-standard - Should we create a page about adding new hosts, with some kind of checklist?
[09:24:06] <elukey>	 joal: yes I can try
[09:24:41] <joal>	 elukey: I'm by no mean saying you've done right, just trying to push to more formal approach for next time (and while saying that, I realize it's me the no-doc-man asking you the doc-man to do more doc... / me ashame)
[09:24:53] <joal>	 you'nve NOT done right sorry
[09:26:09] <joal>	 pfff - elukey I'm gonna stop - you have my point and my typing is so wrong I'm gonna make you unhappy for wrong reasons - Sorry for that elukey  :(
[09:26:34] <elukey>	 yes yes I didn't get it in that way, I was trying to understand how to do it, usually I lay down the upgrade/bootstrap/etc.. process in the task and I don't check specific docs (since every cluster/host is basically different), but I could collect a sort of "wisdom" page broken down by cluster
[09:27:22] <elukey>	 please push for more technical formalities and quality :)
[09:27:22] <joal>	 elukey: a table, with stuff to do as rows, and cluster as column, and comments where needed?
[09:27:46] <elukey>	 yes something like that, it could also be useful for other things more sw related
[09:27:47] <joal>	 elukey: a grid about the grids!
[09:27:56] <elukey>	 (like what to check about druid, kafka, etc..)
[09:28:00] <elukey>	 Archiva :P
[09:28:08] <joal>	  /o\
[09:28:12] <joal>	 :)
[09:28:43] <djellel_>	 joal: argh, I killed it (199 + 1) / 200 for a 30min. :/
[09:30:51] <djellel_>	 joal: elukey: I am debugging a nasty groupby issue that won't scale to the data, I can't necessarily keep looking at the cluster, although I do from time to time. but, please, tell me what I need to set. I had --num-executors=50 --conf spark.dynamicAllocation.maxExecutors=150
[09:31:38] <joal>	 djellel_: max-executor=150 with executor-memory=30g means 4.5Tb, more than the whole cluster
[09:31:40] <djellel_>	 as for the memory, the groupby does a collect_list()... so I had to try increasing the memory of executors
[09:33:19] <djellel_>	 joal: gotcha, I usually allocate just a few gigs for the executors, but yeah, my groupby won't work :(
[09:34:02] <joal>	 djellel_: there are various ways to cheat when scaling is difficult - I can help if you ask :)
[09:34:07] <elukey>	 djellel_: sure but if you are debugging something (so you are not sure about a spark job etc..) it is definitely something good to check every 10/15 mins the status of the job in yarn. It is a refresh of the page, it cost you few seconds and other people will not be impacted
[09:34:38] <elukey>	 you were using 3tb of ram :D
[09:34:44] <joal>	 djellel_: what I'm asking on your side is to make the small math of how much resource your jobs might take, and not take more than 1/2 the cluster (even that is very big)
[09:35:05] <joal>	 djellel_: please
[09:36:22] <joal>	 djellel_: also something to know - when in dynamic allocation mode (on by default), there is no need to specify `--num-executors=50` as it won't be taken into consideration (and might mislead you in thinking you have set limits)
[09:36:43] <djellel_>	 yep, and sorry about that. it was simple command recall with memory pimp. Thanks for your understanding, guys
[09:37:23] <djellel_>	 I've battling with this over the whole weekend ..
[09:37:27] <elukey>	 !log restart refinery-druid-drop-public-snapshots.service after change in vlan firewall rules (added druid100[7,8] to term druid)
[09:37:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:38:19] <djellel_>	 ok, I need to restart again my job. is this reasonable: --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 20G  ?
[09:39:26] <joal>	 djellel_: that's good yes - thanks
[09:42:59] <icinga-wm>	 RECOVERY - Check the last execution of refinery-druid-drop-public-snapshots on an-launcher1001 is OK: OK: Status of the systemd unit refinery-druid-drop-public-snapshots https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:44:02] <elukey>	 joal: do you want to know something nice?
[09:44:15] <joal>	 please elukey!
[09:44:40] <elukey>	 joal: did you see what I just restarted?
[09:44:52] <joal>	 Yes I have!
[09:45:01] <elukey>	 ok something is missing then :)
[09:45:04] <elukey>	 namely alarms
[09:45:06] <elukey>	 from aqs
[09:45:09] <joal>	 elukey: it has run, deleted stuff and not broke anything?
[09:45:17] <elukey>	 seems so!
[09:45:20] <joal>	 \o/
[09:45:51] <elukey>	 the metrics showed an increase in latencies, but then from the on host logs the brokers recovered
[09:46:21] <joal>	 This is awesome - confirmation of the better understanding of the HTTP threading model!
[09:46:35] <joal>	 elukey: --^
[09:46:53] <elukey>	 \o/
[10:13:48] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Dropping data from druid takes down aqs hosts - https://phabricator.wikimedia.org/T226035 (10elukey) Today there was the first drop event after the new settings, and no AQS alarm was raised!
[10:16:37] * elukey early lunch (before interview)
[10:18:02] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "I have double checked code changes in refinery-hive between versions 0.0.100 and 0.0.115 don't impact the updated jobs." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605168 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[10:20:16] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "Triple checked on refinery-job code" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605170 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[10:24:12] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "One change in the code between versions, but should be no-op (package update of needed class). Looks good." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605171 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[10:32:04] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "Looks good :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605173 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[12:09:56] <wikibugs>	 10Analytics-Radar, 10Fundraising-Backlog, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech: Find a better way for WMDE to get impression counts for their banners - https://phabricator.wikimedia.org/T243092 (10Pcoombe) The current setup simply returns impressions for any campaign con...
[12:38:12] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "Big jump here - Some changes, mostly without impact: additional fields returned from GetMediaFilePropertiesUDF not used in HQL, and update" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605172 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[12:38:40] <wikibugs>	 (03CR) 10Joal: [C: 03+2] "Other patches checked -  ready for me :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605174 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[12:44:25] <elukey>	 thanks a lot joal 
[12:44:46] <joal>	 <3 Thank you for the forced cleaning - I have been asking for this for ages !
[12:46:49] <elukey>	 is it super tedious to do the checks above? If not we could establish a workflow to upgrade oozie coordinator when they use "stale" artifacts
[12:46:57] <elukey>	 with "stale" == some convention
[12:47:45] <joal>	 elukey: I'd actually prefer it as you describe: when we change some code, places using that code should be changed/updated as well
[12:48:14] <joal>	 elukey: that way, we could keep only a single (or almost) version of refinery-source in refinery
[12:48:25] * joal dreams of deploys taking a few seconds
[12:49:43] <wikibugs>	 10Analytics, 10Analytics-Kanban: Unique devices, retrofit with bot detection code - https://phabricator.wikimedia.org/T250744 (10JAllemandou) Findings for a day of per-domain uniques, considering domain+country:  - No effect of removing bots traffic on offset, as offset is about actors having made a single cal...
[12:53:06] <volans>	 elukey, ottomata: FYI we've upgraded cumin2001 to buster. So far everything seems to work fine, but given the upgrade would be great if you could test some of your cookbooks to make sure everything works fine.
[12:54:25] <volans>	 Let me/moritz know if you encounter any issue, and feel free to use T245114 for any report.
[12:54:26] <stashbot>	 T245114: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114
[12:56:39] <elukey>	 volans: I don't have anything scheduled to restart but I'll try to do something for hadoop test
[12:57:20] <volans>	 thanks, not sure if a dru-run would show enough in your specific cases
[12:59:03] <mforns>	 hello teammm
[12:59:13] <joal>	 Hi mforns 
[13:00:12] <elukey>	 volans: nono I meant running a cookbook for the hadoop test cluster :D
[13:01:17] <volans>	 elukey: yep, I understood, I meant maybe running in dry-run something else too :)
[13:02:22] <elukey>	 volans: see you SRE people always looking for JVMs to restart, not sure what you have against the Analytics team :D :D :D
[13:02:29] * elukey runs before Moritz reads
[13:03:31] <volans>	 ahahah
[13:08:26] * elukey running errand for a bit
[13:19:58] <wikibugs>	 (03PS2) 10Mforns: POC Airflow Refine [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597623 (https://phabricator.wikimedia.org/T241246)
[13:43:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] oozie: move cassandra config to refinery-cassandra-0.0.115 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605167 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[13:43:33] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] oozie: move cassandra config to refinery-cassandra-0.0.115 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605167 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[13:43:41] <joal>	 elukey: yeah I was about to say - I didn't merge
[13:43:44] <wikibugs>	 (03CR) 10Elukey: [V: 03+2] oozie: move cassandra config to use refinery-hive-0.0.115.jar [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605168 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[13:44:05] <wikibugs>	 (03CR) 10Elukey: [V: 03+2] oozie: update refinery-job jars for mediawiki history coordinators [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605170 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[13:44:12] <wikibugs>	 (03CR) 10Elukey: [V: 03+2] oozie: update refinery-hive jar for unique_devices per project fam coordinators [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605171 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[13:44:20] <wikibugs>	 (03CR) 10Elukey: [V: 03+2] oozie: update refinery-hive jar in insert_hourly_mediacounts.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605172 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[13:44:28] <wikibugs>	 (03CR) 10Elukey: [V: 03+2] oozie: move remaining coordinator properties to use refinery-job 0.0.115 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605173 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[13:44:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+2] Remove artifacts that are not needed anymore. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/605174 (https://phabricator.wikimedia.org/T254849) (owner: 10Elukey)
[13:45:05] <elukey>	 joal: all done, train should be unblocked now, I am filling the etherpad
[13:45:16] <joal>	 ack elukey - thanks
[14:02:34] <elukey>	 joal: qq if you have a moment - are all the cassandra coords taken care by a kill/start of the cassandra bundle? (at the beginning of the month)
[14:02:40] <elukey>	 I completely forgot the procedure
[14:03:03] <joal>	 elukey: yes, all coords can be restarted through bundle
[14:03:25] <joal>	 elukey: I however don't do it like that anymore for the burden to have to wait and not being able to fast deploy
[14:04:59] <elukey>	 joal: ah so do we roll restart all coords separately?
[14:05:12] <joal>	 elukey: that's what I have done lately - not fun :(
[14:06:23] <elukey>	 :(
[14:06:46] <milimetric>	 elukey: I'll hunt down the raw UA string for that Nav Timing malformed event problem
[14:06:55] <milimetric>	 (my ops week now)
[14:08:51] <elukey>	 ack!
[14:23:41] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Nuria) >Is there a way to get the unprocessed UA string for a particular record?  You can look for this particular record (not at all formatted) in the webrequest table, it will appear as a re...
[14:24:59] <wikibugs>	 10Analytics: Investigate showing realtime the eventlogging banner stream (currently sampled at 1%) - https://phabricator.wikimedia.org/T255446 (10Nuria)
[14:25:45] <wikibugs>	 10Analytics-Radar, 10Fundraising-Backlog, 10WMDE-Analytics-Engineering, 10WMDE-FUN-Team, 10WMDE-Fundraising-Tech: Find a better way for WMDE to get impression counts for their banners - https://phabricator.wikimedia.org/T243092 (10Nuria) I have filed a ticket , please be aware that we might not get to it...
[14:44:09] <wikibugs>	 10Analytics-Radar, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Milimetric) Yes, but you have to go back in the pipeline to the `eventlogging-client-side` topic.  So you can use kafkacat.  I don't know if there's a way to get a specific date range (I thoug...
[15:00:45] <nuria>	 milimetric; standup
[15:34:34] <wikibugs>	 10Analytics, 10Analytics-General-or-Unknown, 10AbuseFilter: Provide regular cross-wiki reports on abuse filters actions - https://phabricator.wikimedia.org/T44359 (10fdans) 05Open→03Declined RIP wikimetrics
[15:35:57] <wikibugs>	 10Analytics-Radar, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Product-Analytics (Kanban): Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10fdans)
[15:37:10] <wikibugs>	 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Analytics: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10fdans) there's already code for this in pageviews (the UDF that determines the project given a webhost).
[15:38:01] <wikibugs>	 10Analytics, 10Analytics-Kanban: Purge old files on Archiva to free some space - https://phabricator.wikimedia.org/T254849 (10fdans) p:05Triage→03High
[15:43:44] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Product-Analytics: Request admin access to Superset - https://phabricator.wikimedia.org/T255207 (10fdans) We're looking into the possible creation of an intermediate type of user between alpha and admin that applies to this case.
[15:44:48] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Product-Analytics: Request admin access to Superset - https://phabricator.wikimedia.org/T255207 (10Milimetric) Theory to follow up on in a spike: it could be possible to write a script that inserts our custom roles into the db and survives schema changes, by taking into ac...
[15:46:17] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics: 'namespace_is_content' column in pageview data returns 1, 0 and NULL as booleans in Superset/Turnilo - https://phabricator.wikimedia.org/T255222 (10fdans)
[15:48:35] <wikibugs>	 10Analytics, 10Analytics-Cluster: Create a small VM to host jobs/system timers currently running on stat1007 - https://phabricator.wikimedia.org/T255265 (10fdans)
[15:49:10] <wikibugs>	 10Analytics, 10Analytics-Cluster: Separate Report Updater Jobs to dedicated VM - https://phabricator.wikimedia.org/T255266 (10fdans)
[15:53:00] <wikibugs>	 10Analytics: Investigate showing realtime the eventlogging banner stream (currently sampled at 1%) - https://phabricator.wikimedia.org/T255446 (10Nuria)
[15:53:33] <wikibugs>	 10Analytics: Investigate showing realtime the eventlogging banner stream (currently sampled at 1%) - https://phabricator.wikimedia.org/T255446 (10fdans) p:05Triage→03High
[15:53:44] <wikibugs>	 10Analytics, 10Analytics-Kanban: Investigate showing realtime the eventlogging banner stream (currently sampled at 1%) - https://phabricator.wikimedia.org/T255446 (10fdans)
[16:14:18] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Superset aggregation across edit tags uses all tags - https://phabricator.wikimedia.org/T243552 (10fdans) 05Open→03Resolved According to the way edits_hourly is defined, the data as it's presented is correct. This seems more of a visualization problem...
[16:18:35] <wikibugs>	 10Analytics: Investigate why netflow hive_to_druid job is so slow - https://phabricator.wikimedia.org/T254383 (10fdans) 05Open→03Resolved
[16:20:18] <wikibugs>	 10Analytics, 10Analytics-Kanban: hdfs-rsync of mediawiki history dumps fails due to source not present (yet) - https://phabricator.wikimedia.org/T251858 (10fdans) 05Open→03Resolved
[16:23:02] <joal>	 elukey: interesting finding for me - hdfs fsck doesn't work for me (not hdfs user on an-coord1001 :(
[16:23:17] <wikibugs>	 10Analytics, 10Analytics-Kanban: Refine should DROP IF EXISTS before ADD PARTITION - https://phabricator.wikimedia.org/T246235 (10fdans) https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/602463/
[16:27:27] <wikibugs>	 10Analytics: Revise wiki scoop list from labs once a quarter - https://phabricator.wikimedia.org/T239136 (10fdans) 05Open→03Declined
[16:27:47] <elukey>	 joal: sorry I didn't get it
[16:28:14] <elukey>	 do you need to run hdfs fsck? 
[16:28:31] <wikibugs>	 10Analytics: Look at view stats on our docs from time to time - https://phabricator.wikimedia.org/T240894 (10fdans) 05Open→03Invalid
[16:28:33] <joal>	 np elukey - indeed I'm trying thatn
[16:29:03] <elukey>	 joal: IIRC you need to sudo -u hdfs kerberos-blabla no?
[16:29:14] <joal>	 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs fsck /user/joal/ -files -blocks
[16:29:23] <joal>	 On an-coord1001
[16:30:22] <elukey>	 yes exactly
[16:30:33] <joal>	 elukey: that's what fails for me :)
[16:30:58] <elukey>	 ah ok this was not clear from above :)
[16:31:00] <joal>	 elukey: Exception in thread "main" javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
[16:31:01] <elukey>	 what is the error?
[16:32:11] <elukey>	 ok now it is news to me that hdfs fsck uses TLS
[16:32:42] <joal>	 so is it for me elukey :S
[16:32:46] <joal>	 elukey: this can wait tomorr
[16:37:56] <joal>	 elukey: I'm gonna drop for diner, will be back in a bit
[16:38:39] <elukey>	 o/
[16:42:08] <wikibugs>	 10Analytics, 10Analytics-Kanban: Unique devices, retrofit with bot detection code - https://phabricator.wikimedia.org/T250744 (10Nuria) i see, i think by virtue of accessing a smaller set of data (as data labeled as "user" is now smaller) the unique jobs might be -in theory- a bit faster already.  Funny, i thi...
[16:45:42] <wikibugs>	 10Analytics, 10Analytics-Kanban: Unique devices, retrofit with bot detection code - https://phabricator.wikimedia.org/T250744 (10Nuria) >Findings for a day of per-domain uniques, considering domain+country:   Are these findings for en.wikipedia?
[17:02:06] <elukey>	 joal: so I checked the TLS cert expiry etc.. and they are not expired, especially the namenode's one (my heart stopped for a second)
[17:02:27] <elukey>	 I think it must be the truststore with the self-signed CA cert not picked up
[17:02:29] <elukey>	 but really weird :(
[17:17:33] <wikibugs>	 10Analytics, 10Cloud-VPS, 10Puppet: Puppet failing on wikistats.analytics.eqiad.wmflabs: /usr/local/sbin/x509-bundle error - https://phabricator.wikimedia.org/T255464 (10bd808)
[17:20:53] * elukey off!
[17:26:39] <milimetric>	 mforns: fyi I'm looking at why the history dumps didn't make it to the public server (just in case you saw the email and were going to respond)
[17:27:03] <mforns>	 milimetric: yes I saw, I had put that in my todo list, but not yet looked
[17:35:10] <milimetric>	 I'm having the hardest time remembering/finding how we sync from /wmf/data/archive to the dumps servers
[17:40:46] <mforns>	 milimetric: it's in puppet, looking
[17:42:25] <mforns>	 milimetric: modules/dumps/manifests/web/fetches/stats.pp
[17:42:28] <mforns>	 I think...
[17:43:08] <milimetric>	 yep
[17:43:25] <milimetric>	 it doesn't show up anywhere on wikitech I'll try to find a good place for it
[17:43:28] <milimetric>	 thanks mforns 
[17:47:14] <wikibugs>	 10Analytics, 10Analytics-Kanban: Unique devices, retrofit with bot detection code - https://phabricator.wikimedia.org/T250744 (10JAllemandou) @Nuria : we change from user to bots on pageview table only, not webrequest. Then uniques is being computed with webrequest data as various PII fields are needed for fin...
[17:47:35] <joal>	 nuria: I just commented on the ticket - Do you wish to talk a minute? --^
[17:49:43] <wikibugs>	 10Analytics, 10Analytics-Kanban: Unique devices, retrofit with bot detection code - https://phabricator.wikimedia.org/T250744 (10Nuria) >we change from user to bots on pageview table only, not webrequest.  Ah right, we need to "join" to reduce the data we are combing through, yes.
[17:49:48] <nuria>	 joal: we can talk in bc?
[17:49:52] <joal>	 sure nuria 
[17:56:15] <wikibugs>	 (03PS2) 10Neil P. Quinn-WMF: Whitelist fields in the KaiOSAppFirstRun data stream [analytics/refinery] - 10https://gerrit.wikimedia.org/r/604846
[18:17:24] <wikibugs>	 10Analytics, 10Analytics-Kanban: Create intermediate dataset: pageview with actor information - https://phabricator.wikimedia.org/T255467 (10Nuria)
[18:19:28] <wikibugs>	 10Analytics, 10Analytics-Kanban: Create intermediate dataset: pageview with actor information - https://phabricator.wikimedia.org/T255467 (10Nuria)
[18:30:22] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] Whitelist fields in the KaiOSAppFirstRun data stream [analytics/refinery] - 10https://gerrit.wikimedia.org/r/604846 (owner: 10Neil P. Quinn-WMF)
[18:30:28] <wikibugs>	 (03CR) 10Nuria: [C: 03+2] Whitelist fields in the KaiOSAppFirstRun data stream [analytics/refinery] - 10https://gerrit.wikimedia.org/r/604846 (owner: 10Neil P. Quinn-WMF)
[20:22:07] <wikibugs>	 10Analytics-Radar, 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10RobH) @spatton does have a wikitech account (I checked ldap) but this still needs feedback if https://turnilo.wikimedia.org will meet their needs.   There i...
[20:27:49] <wikibugs>	 10Analytics-Radar, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Product-Analytics (Kanban): Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10RobH) Since this is setting a new user group for sudo as that user(s), shou...
[20:36:00] <wikibugs>	 10Analytics, 10Analytics-Kanban: mediawiki history dumps sync not working - https://phabricator.wikimedia.org/T255485 (10Milimetric)
[20:36:11] <wikibugs>	 10Analytics, 10Analytics-Kanban: mediawiki history dumps sync not working - https://phabricator.wikimedia.org/T255485 (10Milimetric) p:05Triage→03High
[20:47:31] <wikibugs>	 10Analytics, 10Product-Analytics: Can't publish my draft dashboard on superset - https://phabricator.wikimedia.org/T248904 (10Milimetric) 05Open→03Resolved a:03elukey this was fixed with the upgrade to version 0.36  there's no mention of it on their issue tracker or changelog, but I'm glad it's fixed :)
[20:58:51] <wikibugs>	 10Analytics: Investigate tools.wmflabs.org to toolforge.org migration - https://phabricator.wikimedia.org/T250116 (10Milimetric) 05Open→03Resolved Here's what I checked:  * quarry works (but has some weird access issues that I don't understand) * pageviews tool works great, even passes query parameters from...
[21:04:11] <wikibugs>	 10Analytics, 10Analytics-Kanban: Spike, see how easy/hard is to scoop all tables from Eventlogging log database - https://phabricator.wikimedia.org/T250709 (10Milimetric) >>! In T250709#6212980, @Nuria wrote: > Some comments: >  > We do not need to scoop the following tables as that data exists on events datab...
[21:39:27] <wikibugs>	 10Analytics, 10Growth-Team, 10Product-Analytics: Newcomer tasks: update schema whitelist for Guidance - https://phabricator.wikimedia.org/T255501 (10nettrom_WMF)
[21:39:56] <wikibugs>	 10Analytics, 10Growth-Team, 10Product-Analytics: Newcomer tasks: update schema whitelist for Guidance - https://phabricator.wikimedia.org/T255501 (10nettrom_WMF)
[22:03:10] <wikibugs>	 10Analytics-Kanban, 10Analytics-Radar, 10Privacy Engineering, 10Privacy, and 3 others: Identify pending analyses needing access to data older than 90 days - https://phabricator.wikimedia.org/T250857 (10Krinkle) >>! In T250857#6138291, @Mayakp.wiki wrote: > Just been informed that DiscussionTools data will...
[22:09:07] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10RBrounley_WMF) Hey all -  Hugely appreciate the interest and thank you for providing some of the context for this project @Seddon. There are a lot of questions to work through an...
[22:15:45] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10ArielGlenn) Hey @RBrounley_WMF just to clarify, when I was talking about dump reuse, I meant your dumps: using the previous run so you don't request all the revs from restbase ev...
[22:23:18] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10RBrounley_WMF) @ArielGlenn - oh great, yeah I misunderstood that. So the first run is obviously expensive on RESTBase to grab all of the pages but we're thinking about listening...
[22:49:10] <wikibugs>	 10Analytics-Radar, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10CDanis) >>! In T254275#6226129, @RBrounley_WMF wrote: > We’ve done some talking with RESTBase engineers and since most pages everything is cached already, it seems like the infra...