[00:23:07] (03PS1) 10Lex Nasser: Add double quote when constructing JSON in Hive query and change field names in properties file for top-per-country job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/668236 (https://phabricator.wikimedia.org/T207171) [00:27:44] 10Analytics-Radar, 10Better Use Of Data, 10Instrument-ClientError, 10Wikimedia-Logstash, and 2 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10Jdlrobson) I documented alerts here: https://wikitech.wikimedia.org/wiki/Client_errors -... [00:44:45] (03PS4) 10Lex Nasser: Create pageviews 'top-per-country' endpoint with tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171) [00:47:27] (03CR) 10Lex Nasser: "Just tested this change with the AQS test cluster, and the endpoint seems to behave as expected. This change depends on a slight naming al" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [02:35:12] (03PS1) 10Sharvaniharan: Initial commit for migrating image recommendations table. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668243 [02:35:20] (03PS10) 10Milimetric: Add daily referrers Hive table and Oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [02:35:44] (03CR) 10jerkins-bot: [V: 04-1] Initial commit for migrating image recommendations table. [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668243 (owner: 10Sharvaniharan) [02:43:24] (03PS11) 10Milimetric: Add daily referrers Hive table and Oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [02:44:45] (03PS1) 10Sharvaniharan: Rename file [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244 [03:26:09] (03PS12) 10Milimetric: Add daily referrers Hive table and Oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [03:27:58] (03CR) 10Milimetric: "Ok, Isaac, the job seems ok now, the data is ready for review:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [05:21:30] 10Analytics-Radar, 10Cassandra, 10ContentTranslation, 10Event-Platform, and 9 others: Rebuild all blubber build docker images running on kubernetes - https://phabricator.wikimedia.org/T274262 (10KartikMistry) [07:38:05] !log reboot an-worker1096 to pick up 5.10 kernel [07:38:09] good morning :) [07:38:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:00:33] interesting, on an-worker1096 (buster + 5.10) I am not part of the 'render' group [08:02:16] ah right we need to deploy gpu-users [08:29:15] (03CR) 10Elukey: [C: 03+2] Add backticks to reserved word date in geoeditors monthly job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/668111 (https://phabricator.wikimedia.org/T274322) (owner: 10Mforns) [08:34:20] !log deploy refinery to fix https://gerrit.wikimedia.org/r/c/analytics/refinery/+/668111 [08:34:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:34:36] I checked with git fetch + diff that Marcel's change was the only one going out [08:35:00] super elukey - thanks for that [08:35:12] bonjour joal :) [08:35:16] hi eu [08:35:22] hi elukey sorry :) [08:35:51] I opened https://issues.apache.org/jira/browse/BIGTOP-3515 for hive, it was the weird issue for https://phabricator.wikimedia.org/T276121 [08:36:12] if they are ok to merge I'll just rebuild packages and deploy [08:36:40] (Hive 3.x contains the fix, not the 2.x branch, but the patch is really trivial) [08:37:18] ack elukey [08:37:33] elukey: I wonder if we could suggest bigtop to embbed gobblin [08:41:07] it could be an option yes [08:48:44] !log deploy refinery to hdfs [08:48:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:02:55] !log kill/start mediawiki-geoeditors-monthly to apply backtick change (hive script) [09:02:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:32:52] !log reboot an-worker[1097-1101] (GPU workers) to pick up the new kernel (5.10) [09:32:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:32:56] will do one at the time :P [09:33:24] ack elukey [09:33:29] elukey: I'll check failed jobs [09:33:52] joal: don't worry it is my ops week, I break and fix :D [09:34:10] elukey: yeah I've seen that you fix fast indeed :) Thanks for that [09:34:56] joal: nono it is to let you concentrate on your tasks, sometimes you deserve some time without ops :) [09:35:08] elukey: I have a spark job relying on an-worker1121 and an-worker1138 currently - if you can leave them alone for some time I'd be grateful :) [09:35:19] joal: sure! [09:35:48] elukey: already 9.2Tb generated - /me hope that it will not fial [09:36:50] * elukey fingers crossed [09:37:06] after this round of reboots we'll be ready to use yarn labels [09:37:25] \o/ [09:37:50] * joal is happ but doesn't actualy know how to test GPUs :) [09:38:10] elukey: I'll be able to test the labels, but not the GPUs per say [09:38:34] joal: I think that Miriam and Fabian will try to test criteo's tf-on-yarn [09:38:48] great [09:43:13] 10Analytics-Clusters: Configure Yarn to be able to locate nodes with a GPU - https://phabricator.wikimedia.org/T264401 (10elukey) 05Stalled→03Open Aaand we finally have hadoop 2.10.1, so labels are well supported. Let's try to deploy them :) [09:44:30] 10Analytics-Radar, 10Discovery-Search, 10Reading-Admin, 10Research, and 2 others: Image Classification Working Group - https://phabricator.wikimedia.org/T215413 (10Miriam) [09:50:27] 10Analytics-Radar, 10Discovery-Search, 10Reading-Admin, 10Research, and 2 others: Image Classification Working Group - https://phabricator.wikimedia.org/T215413 (10Miriam) [10:04:39] (03CR) 10Joal: "Another round of discussion :)" (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) (owner: 10Mforns) [10:04:45] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix wiring to metrics (032 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/668039 (https://phabricator.wikimedia.org/T271902) (owner: 10Awight) [10:07:05] also elukey: From our comments on task about thorium: let's delete the data on stat1006 :) [10:07:06] 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10Jdrewniak) @mpopov that's right, the client-side instrumentation will have to be updated for the Event Platform. @EYener Th... [10:24:38] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) ` elukey@cumin1001:~$ sudo cumin 'A:hadoop-worker' 'cat /etc/debian_version' 78 hosts will be targeted: an-worker[1078-1128,1130-1132,1135-1138].eqiad.wmnet,analytics[1058-1077].eqia... [10:34:41] joal: sure! [10:34:56] joal: if ok I'd reimage analytics1059 and 1060 [10:35:05] please go elukey [10:35:15] "only" 52 hosts to go [10:35:32] but it looks good, the procedure is streamlined and it seems low impact [10:35:50] we should be done in hopefully 2/3 weeks [10:36:33] (03CR) 10Jdrewniak: WikipediaPortal schema whitelist request (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666223 (owner: 10Erin Yener) [10:40:23] * joal bows to elukey's patience and persistence [10:40:27] !log drain + reimage analytics1059/1060 to Debian Buster [10:40:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:41:20] joal: if by then end of March we have the workers on buster and druid not exploding every month I'll be super happy :D [10:41:26] *the end [10:48:06] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['analytics1059.eqiad.wmnet', 'analytics1060.eqiad.wmnet'] ` The log can be found in... [11:01:42] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), 10Patch-For-Review, and 3 others: Compensate for sampling - https://phabricator.wikimedia.org/T273454 (10awight) [11:03:44] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), 10Patch-For-Review, and 3 others: Compensate for sampling - https://phabricator.wikimedia.org/T273454 (10awight) 05Open→03Declined Let's not bother. The discontinuity only affects VisualEditor template dialog m... [11:19:04] hello team :] [11:19:19] hi mforns [11:19:27] thanks for deploying yesterday's fix elukey [11:19:31] hey joal [11:22:53] mforns: <3 [11:24:50] joal: looking at your review, one q: the first reduce is going to apply after data is filtered, right? Even if the source data is 2GB per day, the source data for the reduce would be much smaller, because we're just using 2 columns, no? [11:25:31] but, nevertheless, ordering by seems a better solution, will do! [11:25:41] mforns: data would be smaller becaused filtered yes, but the job of the window-function is better parallelized IMO [11:25:58] makes sense [11:26:23] also joal: your proposal of using host_properties is exactly the way I did the query first. [11:26:32] but when we moved to BigTop that ceased to work [11:26:46] that's why I changed to using a regex_replace [11:27:07] mforns: maybe you didn't used upgraded version of the packages? [11:28:29] joal: maybe, that should work again then? [11:28:45] mforns: I tested today with the example I gave [11:28:52] ok, great, will change back to that [11:29:07] thanks for the reviewwww! [11:29:10] mforns: using 'hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar' [11:29:22] ok [11:29:32] so 0.1.2 seems to work :) [11:30:01] thank you mforns for accepting my comments :) [11:33:24] joal: does it make sense to ORDER BY NULL, in this case? It would enforce 1 reducer, but be more performant. I think we don't need the ordering... or do we? Maybe Presto is going to like having the data ordered when calculating percentiles?? [11:34:19] mforns: presto will not know about data being ordered in any case - I don't know if 'order by null' works - can be tried :) [11:34:40] in any case the data should be small enough so that it doesn't matter really :) [11:34:45] mforns: --^ [11:34:55] mforns: if the data is bigger, then we'll have to readjust [11:35:14] joal: it works! awight used it in a RU query [11:35:34] ok [11:35:56] I imagine it waorks in enforcing a single reducer? [11:36:00] mforns: --^ [11:36:07] I will tripple-check [11:36:24] mforns: did you see https://phabricator.wikimedia.org/T276121 ? Bigtop merged my patch so I'll probably rollout a new hive version on clients, it is a sneaky bug, not sure if you saw a similar problem with RU [11:36:41] elukey: lookin [11:38:53] elukey: I looked into that problem a bit yesterday before you let me know that you already did, and yea, I saw that beeling would output some empty lines to stdout even with the -silent flag and 2> /dev/null, so I thought that was the problem. I didn't imagine that was the lib's fault... I thought it was us [11:40:10] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1060.eqiad.wmnet', 'analytics1059.eqiad.wmnet'] ` and were **ALL** successful. [11:40:19] mforns: very weird yes, with the patch (live applied on stat1007) the issue went away, if it happens with RU keep it in mind (not sure how many issues your are reviewing with a*wight) [11:40:40] mforns: do you think it is ok to rebuild and rollout the new version? [11:40:56] elukey: yes! [11:41:21] no-one else has reported that problem with reportupdater, but it could be happening in other jobs, not sure [11:41:32] perfect will do then :) [11:41:38] cool, thanks! [11:41:40] going afk for lunch! [11:41:52] :thup: [11:42:03] didn't work :( [11:48:37] joal: Just intuitively, I would be surprised if `order by null` enforced a single reducer, since without ordering the results can be concatenated together in any sequence and no logic is required besides appending rows atomically. [11:49:28] awight: I follow your point, but sometimes systems don't behave in ways wehink they should :D [11:49:42] *we think [11:53:58] taking a break :) [12:04:17] joal, awight: I checked the query with ORDER BY NULL, and it does enforce 1 reducer, thus the output is all in 1 file. [12:05:37] now, given that this is not a behavior defined by hive, it could maybe change in the future without us noticing, for example, if they decide to optimize order by clauses that are no-ops.. [12:06:30] so.... I think it's still better to i.e. ORDER BY wiki, even if we don't use it. thoughts? [12:09:43] mforns: Now that you mention it... I don't have any evidence that `order by null` helped in my case. All I know is that the `explain` showed that the filesort went away, and it was a large amount of data so I figured it was better to avoid the extra writes. [12:10:00] Here, maybe you can just give a hint to force 1 reducer, if that's a behavior you want? [12:10:42] & maybe looking at the overall disk + cpu for the query is helpful for finding the optimization? [12:14:07] awight: thanks! In our case, we just want to enforce 1 reducer, but not for the whole query, just the last step, so that the output data is concentrated in 1 file (to avoid overloading the metastore with too many small files) [12:14:50] the output data is really small, and the performance hit is practically zero [12:16:03] I checked, and there's no evident performance difference between ORDER BY NULL and ORDER BY wiki, in our case (with this data). [12:16:12] I'll go with ORDER BY wiki [12:48:22] !log drain + reimage analytics10[61,62] to Debian Buster [12:48:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:54:36] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['analytics1061.eqiad.wmnet', 'analytics1062.eqiad.wmnet'] ` The log can be found in... [13:14:29] (03PS1) 10Sahilgrewalhere: Fixed typo "paramaters" [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/668412 (https://phabricator.wikimedia.org/T201491) [13:26:30] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1061.eqiad.wmnet', 'analytics1062.eqiad.wmnet'] ` and were **ALL** successful. [13:32:35] !log drain + reimage analytics10[63,64] to Debian Buster [13:32:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:39:40] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['analytics1063.eqiad.wmnet', 'analytics1064.eqiad.wmnet'] ` The log can be found in... [13:41:21] (03PS8) 10Mforns: Add oozie job for session length computation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) [13:45:04] (03CR) 10Mforns: Add oozie job for session length computation (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) (owner: 10Mforns) [14:02:24] * elukey bbiab! [14:06:41] joal, ottomata (and maybe others): first sonar analysis of refinery: https://sonarcloud.io/dashboard?id=org.wikimedia.analytics.refinery%3Arefinery [14:07:05] thanks gehel - will look after meeting :) [14:07:11] hm cool [14:07:11] There is still a global issue on analyzing CRs, but I hope to fix that soon-ish [14:07:39] sonar is running on Java 11, that might bring some issues, but ping me if there is anything strange happening [14:11:38] code coverage isn't reported, I suspect because it isn't configured in the pom. I can dig into it if you want (but no promise on when I'll have time) [14:12:18] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1063.eqiad.wmnet', 'analytics1064.eqiad.wmnet'] ` and were **ALL** successful. [14:21:35] !log drain + reimage analytics1065 to Debian Buster [14:21:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:26:50] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['analytics1065.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20... [14:47:37] hey a-team, I won't be able to attend standup/grosking today, I'm visiting a new school for my daughter, will send an e-scrum later. If you want to talk goals, please go ahead, I'll try to be there end of grosking [14:47:46] (03PS1) 10Majavah: Update to Buster, dh, debsrc 3.0 [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 [14:50:05] (03CR) 10Majavah: Update to Buster, dh, debsrc 3.0 (031 comment) [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 (owner: 10Majavah) [14:51:18] mforns: sounds good! [14:56:38] i guess we have a staff meeting today anyway? [14:57:42] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1065.eqiad.wmnet'] ` and were **ALL** successful. [15:12:02] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['analytics1066.eqiad.wmnet', 'analytics1067.eqiad.wmnet'] ` The log can be found in... [15:12:04] !log drain + reimage analytics106[6,7] to Debian Buster [15:12:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:12:33] (03CR) 10DannyS712: [C: 03+1] Fixed typo "paramaters" [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/668412 (https://phabricator.wikimedia.org/T201491) (owner: 10Sahilgrewalhere) [15:12:48] (03CR) 10Joal: "Probably last round :)" (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) (owner: 10Mforns) [15:13:06] (03CR) 10jerkins-bot: [V: 04-1] Fixed typo "paramaters" [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/668412 (https://phabricator.wikimedia.org/T201491) (owner: 10Sahilgrewalhere) [15:28:09] * elukey coffee [15:34:08] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10JAllemandou) @lexnasser heya - I have found the time to make a deeper analysis of the errors you encountered on your HQL query. I have a query-version that is clo... [15:41:36] (03CR) 10Isaac Johnson: [C: 03+1] "Some comments about format/descriptions but the code (and the temporary results) look good and make sense to me (seeing the same patterns " (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [15:44:43] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1066.eqiad.wmnet', 'analytics1067.eqiad.wmnet'] ` and were **ALL** successful. [15:45:18] 10Analytics, 10Patch-For-Review: Newpytyer python spark kernels - https://phabricator.wikimedia.org/T272313 (10Ottomata) @fkaelin https://gerrit.wikimedia.org/r/c/operations/puppet/+/668466 should fix your original bug around requests and CA certificates. Nice find, thank you! [15:47:49] fdans, sukhe: I'm sorry, but I won't make it to our traffic anomalies meeting today... I have an appointment that is necessarily at this hour. We can mention any updates async via IRC later? [15:51:53] mforns_brb: yes please, don't worry about it! thanks! [16:01:02] ryankemper: retro? https://meet.google.com/ssh-zegc-cyw [16:01:31] gehel: wrong channel but yup, be right there [16:07:04] (03CR) 10Sahilgrewalhere: "Hi @DannyS712," [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/668412 (https://phabricator.wikimedia.org/T201491) (owner: 10Sahilgrewalhere) [16:12:05] 10Analytics: Odd behavior in unique device counts - https://phabricator.wikimedia.org/T276472 (10Isaac) [16:20:47] TFW when you spend an hour trying to debug why you're getting too few (read: 1) results fro your Spar query, and then you realize it says "select count(*)..." [16:26:42] klausman: :)( [16:26:50] err only :) [16:27:06] !log drain + reimage analytics106[8,9] to Debian Buster (one is a journalnode) [16:27:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:29:24] 10Analytics-EventLogging, 10Analytics-Radar, 10Better Use Of Data, 10Event-Platform, and 4 others: OperationError: The operation failed for an operation-specific reason in generateRandomSessionId - https://phabricator.wikimedia.org/T263041 (10Jdlrobson) 05Open→03Resolved I can confirm the fix with the... [16:32:19] (03CR) 10Fabian Kaelin: Add daily referrers Hive table and Oozie job (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [16:35:16] 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10Htchien) The Wikimedia Taiwan group on Phab is inactive because we don't have many peop... [16:36:17] (03CR) 10Fabian Kaelin: Add daily referrers Hive table and Oozie job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [16:41:14] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['analytics1068.eqiad.wmnet', 'analytics1069.eqiad.wmnet'] ` The log can be found in... [16:49:22] PROBLEM - HDFS missing blocks on an-master1001 is CRITICAL: 7 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_missing_blocks https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen [16:52:20] PROBLEM - HDFS corrupt blocks on an-master1001 is CRITICAL: 2143 ge 50 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_corrupt_blocks https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [16:54:39] wow [16:54:42] elukey: any idea? [16:57:24] joal: I am reimaging two nodes, a little strange, let's wait to see [16:58:10] ack [16:58:40] (I am in a meeting but will keep an eye) [17:02:21] yo a-team standup yall [17:04:46] (03CR) 10Isaac Johnson: [C: 03+1] "responding to Fabian's questions" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655804 (https://phabricator.wikimedia.org/T270140) (owner: 10Bmansurov) [17:12:49] RECOVERY - HDFS missing blocks on an-master1001 is OK: (C)5 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_missing_blocks https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen [17:13:00] \o/ [17:13:56] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1068.eqiad.wmnet', 'analytics1069.eqiad.wmnet'] ` and were **ALL** successful. [17:14:15] joal: I'll be more careful with racking details, sorryyy [17:15:28] 10Analytics-Radar, 10Growth-Team (Current Sprint), 10Product-Analytics (Kanban): Growth: remove Homepage and Help Panel schemas from the schema whitelist - https://phabricator.wikimedia.org/T273826 (10nettrom_WMF) 05Open→03Resolved This work is completed! [17:15:33] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: End wider data purge window - https://phabricator.wikimedia.org/T273815 (10nettrom_WMF) [17:16:53] RECOVERY - HDFS corrupt blocks on an-master1001 is OK: (C)50 ge (W)30 ge 3 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_corrupt_blocks https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [17:24:13] 10Analytics-Clusters, 10SRE: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10razzi) [17:24:59] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) ` elukey@cumin1001:~$ sudo cumin 'A:hadoop-worker' 'cat /etc/debian_version' 78 hosts will be targeted: an-worker[1078-1128,1130-1132,1135-1138].eqiad.wmnet,analytics[1058-1077].eqia... [17:25:10] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: End wider data purge window - https://phabricator.wikimedia.org/T273815 (10nettrom_WMF) [17:25:25] 10Analytics, 10Analytics-Kanban, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: remove deletion timers for Growth's sanitized EL tables - https://phabricator.wikimedia.org/T274297 (10nettrom_WMF) 05Open→03Resolved As far as I can tell, this looks good to close as well, cheers! [17:26:09] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: End wider data purge window - https://phabricator.wikimedia.org/T273815 (10nettrom_WMF) [17:26:12] 10Analytics, 10Analytics-Kanban, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10nettrom_WMF) 05Open→03Resolved The tables have been deleted and we've verified that this didn't break anything in the Growth tea... [17:26:57] 10Analytics-Radar, 10Product-Analytics (Kanban): Big increase in traffic for projects except 'wikipedia' family since Feb 14th - https://phabricator.wikimedia.org/T274823 (10razzi) [17:27:11] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: End wider data purge window - https://phabricator.wikimedia.org/T273815 (10nettrom_WMF) 05Open→03Resolved a:03nettrom_WMF This work has been completed, closing as resolved. [17:27:22] 10Analytics-Radar, 10Machine-Learning-Team, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10razzi) [17:30:34] 10Analytics, 10PM: Fix Analytics workflow for #Analytics-EventLogging tasks - https://phabricator.wikimedia.org/T274490 (10razzi) cc @Ottomata [17:37:03] 10Analytics: Reducing logging levels when running a Hive query - https://phabricator.wikimedia.org/T274914 (10razzi) p:05Triage→03Low [17:39:11] 10Analytics, 10Event-Platform, 10Patch-For-Review: WikimediaEventUtilities and produce_canary_events job should use api-ro.discovery.wmnet instead of meta.wikimedia.,org to get stream config - https://phabricator.wikimedia.org/T274951 (10razzi) a:03Ottomata @Ottomata What are the actions left for this task? [17:39:40] 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10Htchien) >>! In T274605#6839960, @Shizhao wrote: >>>! 在T274605#6833068中,@Antigng写道: >... [17:40:22] 10Analytics, 10WMDE-Templates-FocusArea, 10WMDE-TechWish-Sprint-2021-02-17: Backfill metrics for TemplateWizard and VisualEditor - https://phabricator.wikimedia.org/T274988 (10razzi) a:03Milimetric [17:41:47] 10Analytics-Clusters: Upgrade Matomo to latest upstream - https://phabricator.wikimedia.org/T275144 (10razzi) p:05Triage→03Medium a:03razzi [17:42:55] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10WMDE-TechWish-Sprint-2021-02-17: Backfill metrics for TemplateWizard and VisualEditor - https://phabricator.wikimedia.org/T274988 (10Milimetric) a:05Milimetric→03None oh my bad, I guess you're working on this. Let me know when you get to [[ https://wikite... [17:43:29] 10Analytics-Clusters, 10Product-Analytics: Can't re-run failed Oozie workflows in Hue/Hue-Next (as non-admin) - https://phabricator.wikimedia.org/T275212 (10razzi) a:03razzi Let me take a look at this configuration. [17:46:09] 10Analytics, 10Analytics-Wikistats: Wikistats Bug - https://phabricator.wikimedia.org/T275466 (10razzi) p:05Triage→03Low Thanks for your comments @Liz, this would definitely classify as a feature request! We likely won't get around to it for some time. [17:47:13] 10Analytics, 10Analytics-Wikistats: Split wikistats metrics out by namespace - https://phabricator.wikimedia.org/T275466 (10razzi) [17:48:34] 10Analytics-Clusters, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10razzi) Yeah, let's focus on deploying superset and worry about a superset-next domain at a later time. [17:49:12] 10Analytics-Clusters, 10Patch-For-Review: Add 6 worker nodes to the HDFS Namenode config of the Analytics Hadoop cluster - https://phabricator.wikimedia.org/T275767 (10razzi) [17:52:11] 10Analytics, 10Documentation: Wikimedia history dump - undocumented "merge" event - https://phabricator.wikimedia.org/T276119 (10razzi) a:03Milimetric [17:52:42] 10Analytics, 10Documentation: Wikimedia history dump - undocumented "create-page" event - https://phabricator.wikimedia.org/T276120 (10razzi) a:03Milimetric [17:53:34] 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10razzi) [17:54:27] 10Analytics: SLF4J logspam when using hadoop command-line clients - https://phabricator.wikimedia.org/T276240 (10razzi) p:05Triage→03Low [17:55:57] (03PS1) 10Nray: Add new analytics/skin_change schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668529 (https://phabricator.wikimedia.org/T261842) [17:58:06] 10Analytics: Odd behavior in unique device counts - https://phabricator.wikimedia.org/T276472 (10razzi) a:03Milimetric [18:06:42] (03CR) 10Legoktm: [C: 04-1] "You also need to set the compat level in debian/control, probably Depends: debhelper-compat (=12) (see https://nthykier.wordpress.com/2019" (032 comments) [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 (owner: 10Majavah) [18:13:58] looks like I'm too late for end of grooming [18:14:01] sorry folks :( [18:14:56] 10Analytics-Clusters, 10DBA, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) Alright, patches are ready, here are the steps I will run to rename and reimage labsdb1012 to clouddb1021. #### Phase 1: Reimage, rename, set to inset... [18:20:59] me too... sorry [18:24:09] 10Analytics-Clusters, 10DBA, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) > VLAN Type: Analytics (could somebody confirm this for me?) The current labsdb1012 seems to be in the `cloud-support1-a-eqiad` VLAN, I'd leave it l... [18:28:50] 10Analytics, 10Documentation: Wikimedia history dump - undocumented "merge" event - https://phabricator.wikimedia.org/T276119 (10Milimetric) 05Open→03Resolved Thank you for flagging! Updated docs to explain: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FData_Lake%2FEdits%2FMediawiki_history... [18:28:56] 10Analytics, 10Documentation: Wikimedia history dump - undocumented "create-page" event - https://phabricator.wikimedia.org/T276120 (10Milimetric) 05Open→03Resolved Thank you for flagging! Updated docs to explain: https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FData_Lake%2FEdits%2FMediawiki_h... [18:32:29] 10Analytics-Clusters, 10DBA, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) @razzi qq - are you going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/663865 before starting? Also in the plan I'd mention to down... [18:32:32] razzi: --^ [18:41:11] elukey: volans said to decommission then merge the patch; I'll add that to the steps [18:44:24] (03PS2) 10Majavah: Update to Buster, dh, debsrc 3.0 [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 [18:45:17] razzi: the other bit that I am not clear about is the partman recipe for debian install, I don't see clouddb mentioned in netboot.cfg [18:45:37] the problem with that is that you'll not be able to reimage if not present [18:46:06] (03CR) 10Majavah: "Done." (032 comments) [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 (owner: 10Majavah) [18:47:15] ahh see https://phabricator.wikimedia.org/T260441 [18:47:49] and https://gerrit.wikimedia.org/r/c/operations/puppet/+/620529/2/modules/install_server/files/autoinstall/netboot.cfg [18:47:52] razzi: --^ [18:48:09] so in your patch with role::insetup we'll need something like that [18:48:17] but better to double check with data persistence [18:54:07] (03PS9) 10Mforns: Add oozie job for session length computation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) [18:55:28] (03CR) 10Mforns: Add oozie job for session length computation (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) (owner: 10Mforns) [18:56:43] (03PS3) 10Majavah: Update to Buster, dh, debsrc 3.0 [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 [18:59:26] 10Analytics-Clusters, 10DBA, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) @marostegui I see in https://phabricator.wikimedia.org/T260441 that you handled the other hosts, should we just use the `db.cfg` partman config in htt... [18:59:41] razzi: I am logging out in a bit, I left some questions in the task [19:00:04] ok elukey, thanks for your comments so far, I think we're close here, but no rush [19:00:25] in theory we could proceed with the db.cfg, and then just reimage again in case needed (it will take really no time) [19:00:32] when do you want to do it? [19:03:08] elukey: would it be ok to do tomorrow morning while we're both around for a few hours? [19:03:16] razzi: sure! [19:03:23] 10Analytics-Clusters, 10DBA, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) Yes, that one should be fine. It will nuke everything but won't touch the raid level or anything else. Even if you don't use a recipe, that should... [19:03:26] even if it breaks it is fine [19:04:02] cool, a benefit of decommissioning :) [19:05:02] super [19:05:05] * elukey afk! [19:05:10] have a good rest of the day folks :) [19:33:06] 10Analytics, 10Product-Infrastructure-Team-Backlog, 10Wikimedia Taiwan, 10Chinese-Sites, 10Pageviews-Anomaly: Top read is showing one page that had fake traffic in zhwiki - https://phabricator.wikimedia.org/T274605 (10JAllemandou) Hi @Htchien - Thanks a lot for piking this up :) [19:46:35] * razzi out for lunch [19:48:46] hey a-team: I've got a couple of questions about the secondary schema repository, as I'm helping the SD team with some patches to their schemas. Who's got +2 rights on that repo, or how do I find that out? And secondly, how are merges done? [19:49:17] (03CR) 10DannyS712: [C: 03+1] "> Patch Set 1:" [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/668412 (https://phabricator.wikimedia.org/T201491) (owner: 10Sahilgrewalhere) [19:50:27] (03PS1) 10Jhernandez: POC: Using a to show the dbs [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/668544 [19:51:30] (03CR) 10Jhernandez: [C: 04-2] "Do not merge" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/668544 (owner: 10Jhernandez) [19:52:49] (03CR) 10Jhernandez: "I made a POC of using datalist for the DB field for autocomplete, this is what I meant in chat: https://gerrit.wikimedia.org/r/c/analytics" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [19:54:00] (03CR) 10Jhernandez: [C: 04-2] POC: Using a to show the dbs (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/668544 (owner: 10Jhernandez) [20:17:45] hi Nettrom! anybody can get +2 on that repo, as far as i'm concernded! [20:18:03] the reason we have a secondary repo is so we could have more flexible git merge rights [20:18:25] I'd defer to jason and team on that [20:18:48] it'd probably be good to add them for reviews, especially for new instrumentaiton, since they are trying to standardize the way that's done [20:19:01] feel free to add me as well, but don't block on me for a review [20:19:34] Better Use of Data folks are asking these questions a lot too...what is the process now for schema /instrumentation development, so its not well defined yet i thikn [20:19:59] but +2 rights for that repo can and should be given out pretty freely, i think if you are a wmf engineer you should be able to get them [20:45:02] (03CR) 10Bstorm: POC: Using a to show the dbs (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/668544 (owner: 10Jhernandez) [20:50:12] (03CR) 10Joal: "One last ask I forgot about in previous review - please excuse me :S" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) (owner: 10Mforns) [20:50:50] mforns: let me know if my last comment is ok for you ---^ [20:51:12] mforns: I'm about to leave, and prefer to discuss it before if you wish [20:52:19] hey joal, just out of a meeting! [20:52:23] looking at the CR [20:52:25] ack [20:52:34] perfect timing :) [20:54:45] joal: makes sense to me! [20:55:50] thanks mforns :) I'll merge tomorrow :) [20:56:03] ok joal good night!! [20:56:11] thanks for all the good catches [21:00:02] (03PS10) 10Mforns: Add oozie job for session length computation [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) [21:17:42] 10Analytics-Clusters, 10DBA, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10razzi) @elukey thanks for your comments; I edited the plan comment. [21:18:34] 10Analytics, 10Privacy Engineering, 10Research, 10Patch-For-Review: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10Isaac) Status update: * Huge huge thanks to @bmansurov and @Milimetric for getting the code to a good place! * A... [21:19:42] !log rebalance kafka partitions for webrequest_upload partition 9 [21:19:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:45:19] 10Analytics, 10Patch-For-Review: Newpytyer python spark kernels - https://phabricator.wikimedia.org/T272313 (10Ottomata) Actually, I like this fix better: https://gerrit.wikimedia.org/r/c/operations/debs/anaconda-wmf/+/668566 That will have to wait until the I get to make a new anaconda-wmf release (SOON!)