[01:19:01] 10Analytics-Radar, 10Operations, 10Wikimedia-Logstash, 10observability, 10Performance-Team (Radar): Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10Krinkle) I'm confused.. I thought we were already on the Kafka pipeline with udp2l... [06:12:27] PROBLEM - Check the last execution of refinery-sqoop-whole-mediawiki on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:30:23] ouch [06:31:36] seems to be related to shnwiktionary [06:51:12] 10Analytics, 10Product-Analytics: Streamline Superset signup and authentication - https://phabricator.wikimedia.org/T203132 (10elukey) >>! In T203132#6271647, @Ottomata wrote: > It would also be nice if all WMF auth could use one, but I don't think it is possible to do. IIRC, the reason we sometimes use the s... [06:54:13] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move Archiva to Debian Buster - https://phabricator.wikimedia.org/T252767 (10elukey) We moved refinery to use specific repos (mirror-cloudera/mirror-spark/mirror-maven-central/analytics-old-uploads) and the build went fine, everythin... [06:56:02] 10Analytics, 10Analytics-Kanban: Purge old files on Archiva to free some space - https://phabricator.wikimedia.org/T254849 (10elukey) This is basically done. The jobs to be restarted are listed in https://etherpad.wikimedia.org/p/analytics-weekly-train, we should be able to do it gracefully during the next weeks. [06:59:02] meh [07:00:39] elukey: indeed, shmwiktionary seems not present - I suggest manually adding success-files to the folders - does that work for you elukey ? [07:01:21] also elukey, this means the doc about checking for database availability in wikitech opsweek is not correct - will find the difference and update [07:03:46] elukey: Ah! also, sqoop-mediawiki failure means we don't proceed with sqoop-mediawiki-production - We'll need to start that manually :S [07:05:22] joal: yes makes sense! +1 [07:05:32] ack elukey :) [07:06:49] !lof Manually add success-files to sqooped folders (archive, change_tag, change_tag_def, content_models, imagelinks, ipblocks, ipblocks_restrictions, logging, page, pagelinks, page_restrictions, redirect, revision, slots, slot_roles, user, user_groups, wbc_entity_usage) [07:08:05] ok so the "manual" part is launching /usr/local/bin/refinery-sqoop-mediawiki-production [07:08:17] correct elukey [07:09:13] we can do it via tmux easily I think, not perfect but who cares :) [07:09:35] I think that's the easiest elukey - should not be very long (~6h I guess) [07:10:18] elukey: success files manually added :) [07:11:44] super, are you taking care of the tmux or do you prefer me to do it? [07:20:42] !log execute systemctl reset-failed refinery-sqoop-whole-mediawiki.service to clear our alarms on launcher1002 [07:20:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:22:28] sorry elukey - went for coffee [07:22:33] Thanks for tmux :) [07:25:01] all right starting then [07:26:02] !log start a tmux on an-launcher1002 with 'sudo -u analytics /usr/local/bin/kerberos-run-command analytics /usr/local/bin/refinery-sqoop-mediawiki-production' [07:26:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:28:13] RECOVERY - Check the last execution of refinery-sqoop-whole-mediawiki on an-launcher1002 is OK: OK: Status of the systemd unit refinery-sqoop-whole-mediawiki https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:29:59] joal: just sent the email to alerts@ explaining what we did [07:52:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10elukey) [09:17:28] 10Analytics, 10Operations, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [09:19:12] 10Analytics, 10Operations, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) @fgiunchedi: the `puppetmaster` module still has some ganglia-related things such as `prometheus-ganglia-gen`. Is that still needed? [09:45:35] elukey: forgive me if I'm being silly, but I haven't touched stat1007 in a couple weeks... did anything change with kerberos-run-command with the analytics user in that host? [09:45:46] * fdans scans frantically his inbox for an email he's missed [09:46:03] fdans: yep, it is only on launcher1002 now [09:46:45] elukey: yes, of course, this is the thing you talked about several times last week and that my brain didn't process [09:46:54] sorry [09:47:44] fdans: nono please no problem, on the contrary, if it is a hassle to have the user only on launcher1002 we can try to fix it [09:48:06] the idea was to segregate our credentials on a host that we use exclusively [09:49:13] yeayea that makes sense [09:50:35] ("whatever Luca I know you are paranoid don't need to explain that") :D [09:52:27] elukey: sooo within that host it's the same deal, sudo -u analytics kerberos-run-command analytics [...] [09:52:31] ? [09:52:58] yep [09:54:59] ah yeayea I was doing two things, one authed and the other not [09:55:01] my bad [10:00:37] 10Analytics-Radar, 10Operations, 10Wikimedia-Logstash, 10observability, 10Performance-Team (Radar): Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) We are on the Kafka pipeline for MW logs that were sent to logstash ov... [10:01:54] 10Analytics, 10Operations, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10fgiunchedi) >>! In T253555#6273943, @ema wrote: > @fgiunchedi: the `puppetmaster` module still has some ganglia-related things such as `prometheus-ganglia-gen`. Is that still needed?... [10:13:45] email to alerts@ explaining what we did [10:13:50] oops sorry [10:15:51] wow I said about ~6h for sqoop-prod, it actually took ~2h30 [10:16:54] great - something else interesting: shnwiktionary tables are present in analytics-prod replicas, but not yet in labs - Need to add a test about that in opsweek [10:31:47] going afk for lunch [10:33:17] ack elukey :) [10:44:54] doc updated here: the list should be long [10:44:56] oops [10:45:04] https://wikitech.wikimedia.org/wiki/Analytics/Ops_week#Adding_new_wikis_to_the_sqoop_list [10:45:13] * joal has copy/paste issues today :) [12:03:50] 10Analytics, 10DBA: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10jcrespo) [12:04:54] 10Analytics, 10DBA: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10jcrespo) On replication start, instance crashed again- probably there is data/fs corruption. [12:07:40] 10Analytics, 10DBA: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10jcrespo) Same issues as T249188 ? [12:27:35] 10Analytics, 10DBA: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) This host was reimaged to buster recently (2020-06-22) as part of T254870, and the symptoms do sound very like https://jira.mariadb.org/browse/MDEV-22373, with the significant difference that this... [12:37:19] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) [12:37:26] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10ema) 05Open→03Resolved [12:40:15] the issue with denormlize job is very bizarre - It is a supposedly know spark issue for some version of apache commons-lang, but the needed version is correctly referenced in refinery-job pom.cml :( [12:40:54] Will try a rerun, just to see if this is a heisenbug (I don't think it is) - Seems related to jar building dependencies - archiva is probably not yet behind us :( [12:41:23] !log retry mediawiki-history-denormalize-wf-2020-06 [12:41:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:47:45] joal: what is the error? [12:47:58] java.lang.IllegalArgumentException: Illegal pattern component: XXX [12:47:58] at org.apache.commons.lang3.time.FastDatePrinter.parsePattern(FastDatePrinter.java:282) [12:49:21] There low-tech solution is to update the timestamp def - but this is like no cool [12:53:51] 10Analytics, 10Product-Analytics: Streamline Superset signup and authentication - https://phabricator.wikimedia.org/T203132 (10Ottomata) Makes sense to me, if we can be as consistent as possible, sure! [12:55:22] joal: is it something like a new version of a lib not liking a parameter or something else entirely? [12:56:36] ah ok I see, this is a known bug of a previous version of commons-lang, that should patched but that is bitting us [12:56:58] *biting [13:07:41] yes elukey - normally the explicit dep on commons-lang should solve [13:11:57] joal: what is the jar yielding to this weirdness? maybe the mirrored repo was somehow masking/solving the problem for some reason [13:12:09] hello! [13:12:19] good morning :) [13:12:21] elukey: I assume the mirror repo did something, but can't say what exactly :) [13:12:21] joal did we get a successful deploy? If so i'll merge the refine domain filltering stuff [13:12:24] Hi ottomata :) [13:12:37] in puppett [13:12:39] ottomata: successfull deploy indeed, but another issue [13:12:53] should I hold off in case we need to revert? [13:12:54] could actually affect refine [13:12:57] oh [13:13:06] I'd be interested to know if it sdoes [13:14:07] hm, ok joal i'll merge and bump jars and watch and see :) [13:14:10] if not we can revert that [13:14:18] ack [13:23:36] ah the bit that interest us is [13:23:39] [13:23:45] [13:23:47] org.apache.commons [13:23:49] correct elukey [13:23:50] commons-lang3 [13:23:52] 3.5 [13:23:55] [13:24:49] so if https://yarn.wikimedia.org/proxy/application_1592377297555_79683/ finishes correctly we might in trouble [13:25:10] elukey: it just failed :) [13:25:22] ahahahah ok [13:25:29] same error [13:25:35] pffff :( [13:25:40] * joal is sad [13:25:44] that is good, it is not a heisenbug [13:25:48] true [13:26:11] I have a feeling - will try the job manually (no oozie) [13:26:14] elukey: --^ [13:26:29] elukey: page-only (the bit that is failing [13:26:32] what version is the problematic one? 3.4? Maybe something is pulling it for some reason [13:26:43] sure [13:26:45] I think spark uses 3.2.5 [13:26:52] sorry - ooie uses 3.2.5 [13:28:13] please do any test, let's try to nail this down :( [13:28:24] sure [13:35:08] elukey: job started, need to drop for errand - will be back in a bit and continue testing [13:38:41] joal: refine seems to work just fine [13:38:51] launching backfill for remaining wdqs query [13:39:23] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Backfill wdqs_external_sparql_query without filtering on meta.domain - https://phabricator.wikimedia.org/T256797 (10Ottomata) Launching backfill for remaining data: ` sudo -u analytics kerberos-run-command analytics /usr/bin/spark2-submit \ --name refine_ev... [13:45:32] db1108 has two mariadb instances now! [13:45:43] analytics_meta and matomo [13:45:53] the next step is to set up replication [13:59:54] as FYI, dbstore1005 is down for maintenance [14:00:21] the database is not replicating correctly, probably something got corrupted, data persistence is working on it [14:10:35] 10Analytics, 10DBA: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) p:05Triage→03High a:03Kormat [14:10:43] 10Analytics, 10DBA, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) [14:50:51] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) Status of the databases: `analytics-meta` has binlog enabled, with ROW format but not gtid. Mariadb version 10.... [14:56:17] ack ottomata - refine works, great :) [14:56:29] elukey: my manual execution worked- the problem must be related to oozie libs [14:57:13] joal: what do you mean with oozie libs precisely? [14:57:21] (just to focus on what could be wrong) [14:57:48] elukey: I think oozie libs provided to spark contain the faulty version of commons-lang3 [14:59:14] so the ones under /user/oozie/share/lib/ on hdfs? [14:59:35] yessir [14:59:44] ah ok got it now :) [15:00:09] elukey: currently checking :) [15:11:07] ottomata: has the backfilling job started? [15:11:26] ottomata: /wmf/data/event/wdqs_external_sparql_query/datacenter=eqiad/year=2020/month=6/day=1/hour=0 still contains only refined flag [15:13:30] elukey@an-coord1001:~$ ls /mnt/hdfs/user/oozie/share/lib/lib_20200*/spark | grep commons-lang [15:13:33] commons-lang-2.4.jar [15:13:36] commons-lang3-3.3.2.jar [15:13:38] commons-lang-2.4.jar [15:13:41] commons-lang3-3.3.2.jar [15:13:43] joal: it is running now ya [15:13:44] I see old libs only for the "spark" directory [15:13:46] will check it after [15:30:18] 10Analytics, 10Analytics-Cluster, 10ORES, 10Research, 10Scoring-platform-team: Desired packages to be installed/upgraded on the PySpark cluster (jupyterhub) - https://phabricator.wikimedia.org/T249078 (10Ottomata) [15:30:20] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10Ottomata) [15:30:31] 10Analytics, 10Analytics-SWAP: Support R Kernels by default for all users. - https://phabricator.wikimedia.org/T190453 (10Ottomata) [15:30:35] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10Ottomata) [15:37:25] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10Ottomata) @elukey we want to include default packages in our globally installed anaconda distribution that are not included in u... [15:42:27] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10jcrespo) Both servers will need server-id setup (that is why we set it up with the ipv4 integer by default on other host... [16:02:03] ping ottomata ? [16:02:47] BRT [16:17:43] !log rerun mediawiki-history-denormalize-wf-2020-06 after oozie sharelib bump through manual restart [16:17:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:31:07] 10Analytics, 10DBA, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) Data restored from backup, machine is now catching up on s8 replication. There are some extra grants from the backup that should be cleaned up, but otherwise things are in a good... [17:01:55] elukey: denormalize job has passed the failing point and is proceeding noramlly \o/ [17:02:15] * joal sends a missile to oozie [17:06:48] * elukey dances [17:08:27] joal, elukey NICE JOB [17:19:18] nuria: it was the very technical resolution, turning off and on again [17:19:39] elukey: oozie lies to us - I dislike that [17:21:17] joal: true, but there is also a detail to keep in mind [17:21:22] Release Date: July 10, 2015 [17:21:27] right [17:21:44] we are running code almost 5y old :D [17:21:51] :) [17:22:16] 10Analytics, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10faidon) [17:22:27] elukey: also, from Marcel spike, we might be willing to test something than airflow (maybe argo?) [17:22:39] on this subject! [17:23:02] ottomata: is an-scheduler1001 an acceptable name for the new life of notebook1004? [17:23:17] YEAH! [17:23:36] 10Analytics, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10faidon) So - how do we make progress here? Any thoughts on who/how? :) Some of these features could really make a tremendous amount of difference to our network operati... [17:23:39] if we all agree I'll repurpose it tomorrow probably [17:25:04] ack :) [17:28:54] ottomata: Should we invite Chris to the hangtime we're having now? [17:31:02] * elukey off [17:31:07] bye elukey [17:31:12] o/ [17:31:36] Thanks again for fixing :) [17:49:06] (03CR) 10Nuria: Update whitelisting of Growth Team's schemas (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607615 (https://phabricator.wikimedia.org/T255501) (owner: 10Nettrom) [17:50:53] elukey: +1 like it [18:10:29] hm - another failure :( [18:16:19] Ok I'm gonna manually run the job and investigate why oozie fails tomorrow [18:16:53] !log Launch a manual instance of mediawiki-history-denormalize to release data despite oozie failing [18:17:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:18:08] app-id of the job: https://yarn.wikimedia.org/proxy/application_1592377297555_80681/ [18:32:21] 10Analytics, 10Better Use Of Data, 10Product-Analytics: Bug: 'Include Time' option in table visualization produces "0NaN-NaN-NaN NaN:NaN:NaN" - https://phabricator.wikimedia.org/T256136 (10MBinder_WMF) {F31914465} [18:40:46] 10Analytics, 10Better Use Of Data, 10Product-Analytics: Bug: 'Include Time' option in table visualization produces "0NaN-NaN-NaN NaN:NaN:NaN" - https://phabricator.wikimedia.org/T256136 (10Ottomata) {F31914481} [20:08:41] 10Analytics, 10Jupyter-Hub: PySpark Error in JupyterHub: Python in worker has different version - https://phabricator.wikimedia.org/T256997 (10diego) [20:14:03] 10Analytics, 10Jupyter-Hub: PySpark Error in JupyterHub: Python in worker has different version - https://phabricator.wikimedia.org/T256997 (10diego) Additional information, this is the python version (in the driver, I guess): ` > from platform import python_version​ > print(python_version()) 3.7.3 ` [20:15:16] joal: hiya! safe to close https://phabricator.wikimedia.org/T256515 as resolved, right? [20:29:35] Hey bearloga - ou can close it indeed :) [20:30:23] 10Analytics, 10Product-Analytics: New app pageview definition needs to be deployed - https://phabricator.wikimedia.org/T256515 (10mpopov) 05Open→03Resolved mobile-html requests are now counted as pageviews [20:30:25] 10Analytics, 10Product-Analytics, 10Epic: API pageview counts for 'Mobile app' are incorrect since switch to mobile-html - https://phabricator.wikimedia.org/T256508 (10mpopov) [22:26:00] (03CR) 10Nettrom: "tokens are typically per page load, with one exception described here" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607615 (https://phabricator.wikimedia.org/T255501) (owner: 10Nettrom) [22:35:42] (03CR) 10Nuria: [C: 03+2] Update whitelisting of Growth Team's schemas [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607615 (https://phabricator.wikimedia.org/T255501) (owner: 10Nettrom) [22:35:47] (03CR) 10Nuria: [V: 03+2 C: 03+2] Update whitelisting of Growth Team's schemas [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607615 (https://phabricator.wikimedia.org/T255501) (owner: 10Nettrom) [22:36:09] 10Analytics-Radar, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Multi-Content-Revisions (New Features), 10User-ArielGlenn: MCR: Import all slots from XML dumps - https://phabricator.wikimedia.org/T220525 (10CCicalese_WMF) [23:25:34] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Technical contributors emerging communities metric definition, thick data - https://phabricator.wikimedia.org/T250284 (10jwang) @Nuria , @Bmueller Here are the historical trends of ruwiki, rowiki and svwiki. You can explore other wikis at dashboard https... [23:36:19] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Technical contributors emerging communities metric definition, thick data - https://phabricator.wikimedia.org/T250284 (10Nuria) I think now we need some wikis that are on the other side of the spectrum in terms of edits/articles and bots to compare findin...