[01:09:24] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Structured-Data-Backlog, 10Patch-For-Review: Add image table to monthly sqoop list - https://phabricator.wikimedia.org/T266077 (10nettrom_WMF) Thank you for digging into this @JAllemandou ! Since `img_metadata` is a serialized PHP array, I think th... [04:31:41] 10Analytics, 10Analytics-Dashiki: Chart data from analytics.wikimedia.org do not fully specify Windows 7, 8, and 10 versions - https://phabricator.wikimedia.org/T269729 (10gh87) [06:47:55] goood morning [06:50:49] !log stop timers on an-launcher1002 as prep step to restart hive [06:50:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:05:16] !log restart hive metastore and server2 on an-coord1001 to pick up settings for DBTokenStore [07:05:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:07:01] !log restart hive-server2 on an-coord1002 for consistency [07:07:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:11:19] tested beeline all good, now trying spark2-shell just to be sure [07:11:29] yep all good! [07:12:34] !log re-enable timers after maintenance [07:12:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:12:59] ok so now let's see if anything strange comes up with the dbtoken store, hopefully none [07:13:13] this will give us the ability to add a metastore on an-coord1002 anytime [07:13:30] and to finally unblock the analytics-hive migration [07:13:35] (brb) [08:29:03] (03CR) 10WMDE-Fisch: [C: 03+2] "Lets try this." [analytics/wmde/TW/edit-conflicts] - 10https://gerrit.wikimedia.org/r/620000 (https://phabricator.wikimedia.org/T246439) (owner: 10Awight) [08:40:22] Good morning [08:40:27] bonjour [08:40:41] Man, what an efficient morning elukey! [08:41:23] joal: ahhaha I have restarted a daemon, not really my best technical work every :D: [08:42:24] I understand the feeling, but operations on live systems are always to prepared, then checked, then deployed carefully and tested etc - Thanks for that :) [08:43:01] http://www.quickmeme.com/meme/3r73wi [08:43:03] :D [09:09:02] (03CR) 10Awight: [C: 03+1] "> I love how these reports are broken down into small pieces. But there might be some small efficiency gain if we collect all the prefere" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/645345 (https://phabricator.wikimedia.org/T260138) (owner: 10Andrew-WMDE) [09:10:28] (03PS5) 10Awight: Process EventLogging events and tally preferences for CodeMirror [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/645345 (https://phabricator.wikimedia.org/T260138) (owner: 10Andrew-WMDE) [10:09:31] * elukey bbiab, errand! [10:59:25] (03CR) 10Andrew-WMDE: [C: 03+1] Process EventLogging events and tally preferences for CodeMirror [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/645345 (https://phabricator.wikimedia.org/T260138) (owner: 10Andrew-WMDE) [11:00:04] (03CR) 10Andrew-WMDE: [V: 03+2] Sample more of 2020 [analytics/wmde/TW/edit-conflicts] - 10https://gerrit.wikimedia.org/r/620000 (https://phabricator.wikimedia.org/T246439) (owner: 10Awight) [11:00:15] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Investigate sporadic failures in oozie hive actions due to Kerberos auth - https://phabricator.wikimedia.org/T241650 (10elukey) Today we applied again the change to store hive session tokens on the db (restarting the hive daemons on an-coo... [11:44:01] * elukey lunch! [12:01:55] 10Analytics-Clusters, 10Operations, 10ops-eqiad: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10jbond) p:05Triage→03Medium [12:58:51] 10Analytics, 10Analytics-Wikistats, 10Product-Analytics: Contribution inequality graphs for Wikistats - https://phabricator.wikimedia.org/T195033 (10GoranSMilovanovic) @Jan_Dittrich https://www.rdocumentation.org/packages/REAT/versions/3.0.2/topics/hoover [13:13:38] hellooo [13:14:15] 10Analytics, 10Analytics-Wikistats, 10Product-Analytics: Contribution inequality graphs for Wikistats - https://phabricator.wikimedia.org/T195033 (10Jan_Dittrich) Thanks – that’s great! I created the [calculation in JavaScript](https://github.com/jdittrich/WikiHoverScore/blob/hooverCalculations/src/js/calcul... [13:15:58] (03PS4) 10Fdans: Expand EZ project conversion to adapt to raw format [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/632597 [13:16:27] (03CR) 10Fdans: "joal: thank you for the quick review!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/632597 (owner: 10Fdans) [13:23:15] (03CR) 10Joal: [C: 03+2] "LGTM! Thanks fdans :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/632597 (owner: 10Fdans) [13:30:11] (03Merged) 10jenkins-bot: Expand EZ project conversion to adapt to raw format [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/632597 (owner: 10Fdans) [13:44:06] 10Analytics-Clusters, 10Operations: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10klausman) This is definitely doable, but needs at least one change: The Bullseye version of the package depends on librdkafka1 >= 1.4.2, which Buster... [13:45:17] ottomata: looks like building kafkacat 1.6.0 is doable on Buster, with minor tweaks [13:45:46] elukey: Say I have a way of doing the above, what is the correct process to get it available on the stat machines? [13:46:34] klausman: hi! So Faidon is the Debian upstream maintainer of the package, I was kinda hoping that he could have added the package to debian buster backports with the correct deps [13:48:04] Good point. I shall do some poking [13:48:12] :D [13:49:15] otherwise if not feasible we could have our own version in buster-wikimedia, but in theory it would require a separate repo also to manage it etc.. (or we could do a one-off and avoid a gerrit repo to track the source, but if we then need to change something it will be a mess) [13:50:27] Aye. I was presuming we could do the latter if all else fails, but likewise, I'd rather avoid it [13:50:46] super [13:52:30] 10Analytics: Kerberos credential cache expiry time on notebook is different than the OS one - https://phabricator.wikimedia.org/T247084 (10elukey) [13:52:34] 10Analytics-Clusters: Kerberos credential cache location - https://phabricator.wikimedia.org/T255262 (10elukey) [13:53:11] klausman: at some point I'd like to pick your brain/thoughts on --^ [13:53:23] that requires a little introduction to the mess before [13:54:15] but it is connected also to another thing that I am trying, namely explore ways to periodically/automatically renew kerberos credentials for users with a valid ticket [13:54:39] (I can also explain this mess as well, and how everything fits together in something magnificent) [13:55:21] is KRB5CCNAME user-specific or a basename? [13:56:57] At any rate, it might be possible to move its setting into pam_env.conf, which I *think* is evaluated even if there is no shell (unlike /etc/profile) [14:00:27] Or /etc/environment, for that matter [14:03:03] 10Analytics-Clusters: Kerberos credential cache location - https://phabricator.wikimedia.org/T255262 (10klausman) I presume the `KRB5CCNAME` in `/etc/profile` approach doesn't work due to there being no shell involved. Do you think using `/etc/environment` instead might work better? Or are there further problems... [14:07:06] 10Analytics-Clusters, 10Operations: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10klausman) I've also poked Faidon on whether an official backport might be done. [14:11:54] klausman: I am in meeting up to 16:00, we can chat after that if you have a min? [14:12:01] so I can give you the background etc.. [14:12:21] I am sure you'll have super good ideas that I completely ignore :D [14:16:12] sure [14:57:35] klausman: hw arrived! (I saw the ml-serve hosts) [14:57:43] yes [14:58:07] and now to clear up the misunderstandings in N-way communications between me, Chris, DC Ops and a bunch of others :D [14:58:31] klausman: yep :D [15:02:57] klausman: come onnnn [15:02:57] :D [15:24:23] hey hey elukey ! are there any new steps to getting eventlogging schemas working? [15:24:50] We made https://meta.wikimedia.org/wiki/Schema:WikibasePingback and we are pinging the beacon expecting things to appear, but perhaps there are other steps? [15:25:00] reading https://www.mediawiki.org/wiki/Extension:EventLogging/Guide i dont easily spot any [15:43:10] Does https://gerrit.wikimedia.org/g/analytics/refinery/+/ce1d5fd2f05d54560875276e177741b7b77bf0f4/static_data/eventlogging/whitelist.yaml come into it? [15:46:55] aaah, maybe the talk page was needed [15:50:15] 10Analytics-Clusters, 10Operations, 10ops-eqiad: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) Still working with Dell on this, tried reseating the raid controller and the cables, the raid card is still not recognized by the bios. [15:51:30] addshore: sorry I was in a meeting! What source of el data are you checking out? hive? [15:53:13] mmm so the topic for the schema is there, but with no events [15:53:14] kafkacat -t eventlogging_WikibasePingback -b kafka-jumbo1001.eqiad.wmnet -C -o beginning [15:53:19] (this from stat1004) [15:54:00] addshore: ah no I see an event! [15:54:30] so now it will take a bit of time for our automation to pick up hourly data, refine it, etc.. [15:54:38] but it should get into hive today for sure [15:55:39] awesome! yes I see them too now! [15:55:47] \o/ Thank you addshore and elukey [15:55:52] I think the talkpage is needed before things get picked up correctly which we missed in the docs [16:00:47] I now choose that "El Data" is a mythical mexican folk hero that fights injustices with evidence-based analysis. [16:03:05] * joal wish klausman also write super-hero books [16:04:35] Ugh. I hate most super-hero content. :) [16:05:29] Arf - ok :) [16:07:33] The main issue I have with superhero stuff is that it never really looks at the societal consequences of super heroes being around (there are exceptions, like Watchmen, which is also a deconstruction of the genre) [16:17:11] 10Analytics, 10Product-Analytics, 10Inuka-Team (Kanban): Set up preview counting for KaiOS app - https://phabricator.wikimedia.org/T244548 (10SBisson) The PR is merged [16:22:02] (03CR) 10Mforns: Sanitize and keep TemplateDataEditor events (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/646670 (https://phabricator.wikimedia.org/T260343) (owner: 10Awight) [16:32:59] 10Analytics, 10Event-Platform, 10Product-Analytics, 10Product-Infrastructure-Data: MEP: Should stream configurations be written in YAML? - https://phabricator.wikimedia.org/T269774 (10jlinehan) [16:47:30] (03PS1) 10Lucas Werkmeister (WMDE): Update MediaWiki CodeSniffer to version 34.0.0 [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/647282 [16:47:32] (03PS1) 10Lucas Werkmeister (WMDE): Extract WikimediaSparql helper class [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/647283 [16:47:38] (03PS1) 10Lucas Werkmeister (WMDE): Add script to collect lexicographical data statistics [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/647284 [16:48:16] (03CR) 10jerkins-bot: [V: 04-1] Extract WikimediaSparql helper class [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/647283 (owner: 10Lucas Werkmeister (WMDE)) [16:52:38] (03CR) 10DannyS712: [C: 03+1] Update MediaWiki CodeSniffer to version 34.0.0 [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/647282 (owner: 10Lucas Werkmeister (WMDE)) [17:02:50] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Vertical: Virtualpageview datastream on MEP - https://phabricator.wikimedia.org/T238138 (10kzimmerman) @sdkim @Ottomata @jlinehan When we migrate this, we need to make sure we include geocoded data. It's... [17:03:11] yo a-team standup or what [17:14:06] 10Analytics, 10Analytics-Kanban, 10Privacy Engineering, 10Product-Analytics, and 3 others: Drop data from Prefupdate schema that is older than 90 days - https://phabricator.wikimedia.org/T250049 (10nettrom_WMF) I ran the query below and confirmed that the re-sanitization had filled in the missing sanitized... [17:15:47] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey Here is what is in racks now (not setup) 2 servers in A2 2 servers in A4 I requested @dzahn to move 3 mw servers to make room... [17:19:10] (03CR) 10Lucas Werkmeister (WMDE): "recheck, looks like a random git clone error" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/647283 (owner: 10Lucas Werkmeister (WMDE)) [17:57:55] (03CR) 10Joal: [C: 04-1] "The list of WMF domains should be sourced from a reference place. The rest looks great :)" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646808 (https://phabricator.wikimedia.org/T256674) (owner: 10Ottomata) [18:01:47] (03CR) 10Joal: [C: 03+1] "Works for me when parent patch is ready - Only think I am not able to validate is the columns chosen to check domain for - I trust you on " [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646828 (https://phabricator.wikimedia.org/T256677) (owner: 10Ottomata) [18:10:25] 10Analytics, 10Product-Analytics, 10Inuka-Team (Kanban): Set up preview counting for KaiOS app - https://phabricator.wikimedia.org/T244548 (10Jpita) can someone check that the metrics are being sent? also, I can't move this to any other column . [18:35:49] * elukey afk! [18:35:59] joal: I'll add the extra metastore tomorrow morning :) [18:40:58] PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error executing row event: Table superset_staging.ab_user doesnt exist https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:48:07] 10Analytics, 10Product-Analytics, 10Inuka-Team (Kanban): Set up preview counting for KaiOS app - https://phabricator.wikimedia.org/T244548 (10SBisson) >>! In T244548#6680052, @Jpita wrote: > can someone check that the metrics are being sent? > also, I can't move this to any other column . @nshahquinn-wmf di... [18:51:27] arf [18:51:57] razzi: would the error above be related to your work on superset? [18:52:41] joal: I think so [18:53:06] Ok razzi thanks for confirming - no big deal, I was just wondering if there was something to care [18:54:34] yup, thank you joal - experimenting with the staging superset instance and seeing some errors on my end as well [18:55:02] Ack! [19:26:43] PROBLEM - Presto Server on an-presto1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [19:32:55] RECOVERY - Presto Server on an-presto1001 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [19:34:41] addshore: hey I just noticed you have https://meta.wikimedia.org/wiki/Schema:WikibasePingback producing some events. It's failing to refine I guess because the "extensions" property needs to specify what type of items it contains? Lemme know if this rings a bell, I can dig deeper and figure it out, just wanted to know your plans [19:44:54] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @dzahn and I were able to move mw1281-1283 and we now have 6 servers total in row A. [20:04:09] mforns: you around? [20:04:17] yep1 [20:04:18] need a quick consult on refine [20:04:22] bc? [20:04:30] we should be able to do it here [20:04:42] so this schema is failing to refine: https://meta.wikimedia.org/wiki/Schema:WikibasePingback [20:04:58] because it has an array and the exception is: "Original exception: java.lang.IllegalArgumentException: `extensions` array schema did not specify the items field" [20:05:32] this other schema is active and has an array field, with the items properly specified: https://meta.wikimedia.org/wiki/Schema:PageIssues [20:05:43] but that's excluded from refine: https://github.com/wikimedia/puppet/blob/6ac62bdd448bc8972dc1ca00cf508e5253f6d820/modules/profile/manifests/analytics/refinery/job/refine.pp#L130 [20:05:57] so is it that we can't refine schemas with arrays? [20:06:02] should I blacklist the new one [20:07:13] milimetric: reading, sorry was answering on another thread [20:07:22] np at all [20:07:29] thank you for taking a look [20:09:10] milimetric: I think we do refine array fields [20:09:12] checking [20:10:01] hm, maybe I should blacklist for now and advise to specify the items property of the extensions property? [20:12:24] milimetric: refine does indeed refine array fields if correctly specified [20:12:28] see: https://meta.wikimedia.org/wiki/Schema:CentralNoticeBannerHistory [20:12:45] this schema is correctly specified and has refined data for the last 3 months [20:13:00] I believe the issue here is the schema [20:13:22] milimetric: yes you're right [20:13:57] we should temporarily blacklist and open a ticket for the schema owners [20:14:45] ok, done, I cc-ed Adam, I'll open a task if he doesn't respond soon [20:17:39] hey razzi, just added you to a puppet review, simple blacklisting of a schema that's not refining (details above if you're curious). Let me know if you want to dig deeper into any of that [20:19:27] milimetric: cool, reading up [20:25:05] milimetric: lgtm [20:25:24] razzi: anything ongoing with an-coord1002? [20:25:49] elukey: don't think so, what do you mean? [20:26:12] razzi: there is an alert with [20:26:12] PROBLEM alert - an-coord1002/MariaDB Replica SQL: analytics-meta-replica is CRITICAL [20:26:25] so the replica is broken due to a table changed [20:26:37] ahh wait "Table superset_staging.ab_user doesnt exist" [20:27:37] elukey: Yeah, that happened when I accessed the staging superset; I'll acknowledge the alarm [20:28:05] razzi: well let's fix the problem no :) ? [20:28:16] :) [20:28:43] so on an-coord1002 there is a mysql instance that replicates from an-coord1001 [20:29:10] and if you check on 1002, (sudo mysql; show databases) the database that it is erroring out it is not listed [20:29:17] since we don't care about replicating it [20:29:48] so when you accessed superset staging (that uses a db on an-coord1001) it added some things to a table, that didn't replicate on 1002 [20:30:33] on db1108, the other db that replicates from an-coord1001, we blacklisted superset_staging [20:30:45] I think that we need to apply the same fix on 1002 [20:30:47] lemme check [20:31:51] yes we have on db1108 [20:31:51] # Do not replicate the following databases [20:31:51] replicate_wild_ignore_table=superset\_staging.% [20:34:07] !log execute on mysql:an-coord1002 "set GLOBAL replicate_wild_ignore_table='superset_staging.%'" to avoid replication for superset_staging from an-coord1002 [20:34:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:34:33] show slave status \G; looks good (all running) [20:34:38] so it should recover soon [20:34:55] RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:35:14] there we go! elukey: could that have been done with a config file? [20:36:04] razzi: yep I am filing a code review, but it would require a mysql restart, so I applied the global override for the moment [20:39:11] razzi: this is the permanent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/647358 [20:39:27] I need to verify with Data Persistence if it is the right thing to do but for the moment we are ok :) [20:45:00] razzi: there is also an interesting alarm that fired from an-presto1001 [20:45:28] elukey: yeah, I see that; thinking of running puppet manually to inspect the output [20:45:31] or I could view logs [20:46:10] razzi: so running "sudo dmesg -T" on it shows the OOM killer acting [20:46:18] [Wed Dec 9 19:23:51 2020] Killed process 44512 (java) total-vm:147090276kB, anon-rss:123181256kB, file-rss:0kB, shmem-rss:0kB [20:46:19] yeah, I see also `# There is insufficient memory for the Java Runtime Environment to continue.` [20:46:59] https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1 [20:47:19] not a lot lining up with 19:23 though, weird [20:48:03] if I have to guess somebody is making a huge query in presto [20:49:36] but sadly I think we don't have a query log for presto [20:51:49] http://dharmeshkakadia.github.io/presto-event-listener/ [20:51:59] so we'd need to write something custom sigh [20:52:01] joal: --^ [20:52:23] razzi: so we have two issues [20:52:43] 1) we don't have a log for presto stating "I am excuting big query xyz from user batman etc.." [20:52:53] so it is difficult to track down problems [20:53:18] mmhm [20:53:21] 2) in theory even a big query shouldn't cause the OOM killer to kill the java process, so we might need to adjust heap settings [20:55:50] razzi: I think that we have only server logs, not query ones.. [20:56:43] another guide is https://aws.amazon.com/blogs/big-data/custom-log-presto-query-events-on-amazon-emr-for-auditing-and-performance-insights/ [21:01:14] razzi: answered to the alert emails! Tomorrow there are some follow ups to do :( [21:01:24] going afk again ttl! o/ [21:01:30] alright! cya elukey [21:50:15] 10Analytics-Clusters, 10Product-Analytics: Configure superset cache - https://phabricator.wikimedia.org/T268784 (10razzi) Current progress: configured staging superset to use memcache, but pylibmc was installed as an apt package and the process uses a virtual environment, so pylibmc needs to be installed there... [21:55:40] whoever (Luca?) added the information about when kinit tickets expire when logging onto stat matchines. THANK YOU! [22:06:44] 10Analytics, 10Analytics-Kanban, 10Image-Recommendations, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10CBogen) [22:20:17] milimetric: thanks for the ping, i forwarded the message to the team working on it :) [22:21:32] milimetric: that isnt something is defined in the schema page? =o [22:21:46] I can say that the elements of the "extensions" array are strings :) [22:21:47] thx addshore! Feel free to cc them on the email I sent [22:21:54] *checks inbox* [22:22:08] I know, yeah, but figure the schema version has to be updated etc. [22:22:22] <3 for the email, awesome! [22:22:36] They'll get to it tommorrow :) [23:28:46] (03PS1) 10Razzi: Install pylibmc and update wheels for superset [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/647387 [23:54:05] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Vertical: Virtualpageview datastream on MEP - https://phabricator.wikimedia.org/T238138 (10sdkim) @kzimmerman acknowledged. Last speaking to @Ottomata, this would require it's own migration plan and has be...