[02:45:58] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:29:41] 10Analytics, 10Analytics-Kanban: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10Nuria) >Logs display an error about one of the basic hive classes not being available: Seems that if the series does not have 24 points (for say, a daily measure of hour... [05:33:34] !log restart hadoop yarn nodemanager on analytics1071 [05:33:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [05:33:45] failed due to spark shuffle --^ [05:33:50] (and heap oom) [05:34:30] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:35:45] !log re-run two failed hours for webrequest load text (07/05T05) and upload (06/05T23) [05:35:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [05:38:42] elukey: was the info about spark shuffle in the logs on 1071 on var/log? [05:39:30] nuria: hola! yes, /var/log/hadoop-yarn/hadoop-yarn-nodemanger..etc.. [05:39:58] elukey: and can i see those with sudo -u hdfs? [05:39:59] it is sadly something that we have been seeing recently, I am going to increase the yarn node manager heap size today [05:40:56] nuria: even from your user, they should be readable from all [05:42:23] elukey: ah yes, i was looking at syslogs [05:42:34] elukey: got it [05:46:10] !log re-run mediarequest-hourly-wf-2020-5-6-19 [05:46:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:01:31] mmm a lot of jobs failing for java.lang.NoClassDefFoundError: org/apache/hive/service/cli/HiveSQLException [06:01:35] including mw history [06:03:41] all spark jobs [06:06:02] and they started after the deploy [06:21:31] ah no lovely also webrequest fails [06:38:07] elukey: I'm here if you need me to look at anythin [06:40:48] fdans: hola! [06:40:58] I am not sure what's happening, there is a weird hive error [06:41:10] it seems as if the hive-service.jar wasn't picked up by oozie or similar [06:41:44] let me try with the hammer [06:41:50] !log restart oozie on an-coord1001 [06:41:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:43:31] re-running a job to see if anything changes [06:43:52] yesterday I used oozie admin shlib upgrade and I just want to make sure that oozie isn't in a weird state [06:51:51] lol it completed fine [06:52:02] trying with webrequest [06:54:06] 10Analytics: Creation of canonical pageview dumps for users to download - https://phabricator.wikimedia.org/T251777 (10fdans) Some more things to consider after team discussion: - For this new dump to replace the Pageviews dump we would have to provide not only the access method, but also the agent type dimen... [07:06:07] !log restart mediawiki-history-load via hue [07:06:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:08:13] and of course the cluster is super busy [07:12:15] Nathan is using ~25% of the total RAM with a spark job :) [07:18:45] !log execute yarn application -movetoqueue application_1583418280867_332862 -queue root.nice [07:18:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:18:48] let's see if it works [07:28:06] ebernhardson: o/ I was checking https://yarn.wikimedia.org/proxy/application_1583418280867_333560/mapreduce/conf/job_1583418280867_333560 [07:28:16] it consumes ~25% of the total ram available [07:28:22] (in the cluster) [07:28:35] it is in the root.nice queue but expected to be so massive? [07:32:11] !log re-run mediawiki history load [07:32:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:38:52] dcausse: o/ [07:38:58] o/ [07:39:11] goood morning :) [07:39:16] morning! :) [07:39:22] so there is a big job running by analytics-search [07:39:22] https://yarn.wikimedia.org/proxy/application_1583418280867_333560/mapreduce/conf/job_1583418280867_333560 [07:39:35] that is now consuming ~ half of the total ram of the cluster :D [07:39:51] if you look for "hive.query.string" [07:39:54] you can see the full query [07:40:03] (there is a search bot in the top right corner) [07:40:11] SELECT '2020-05-06' AS date, search_classify(uri_path, uri_query) AS api, referer_class, COUNT(1) AS calls FROM webrequest WHERE etc.. [07:40:40] interesting [07:41:29] looks like a query for populating our dashboards at https://discovery.wmflabs.org/ [07:41:38] which are broken since months [07:42:43] I'd be tempted to kill the job and see if we can tune it [07:42:54] elukey: yes it's fine to kill [07:43:02] currently allocated MBs: 1933312 [07:43:17] that is ~2TB :D [07:43:20] ack then thanks! [07:43:35] we might just want to stop generating this data [07:43:43] I'll bring this up [07:43:46] !log kill application_1583418280867_333560 after a chat with David, the job is consuming ~2TB of RAM [07:43:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:43:49] dcausse: <3 [07:44:28] done, half of the cluster free now :D [07:45:18] !log re-run mediawiki-history-denormalize [07:45:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:56:16] ok the cluster is busy but manageable [07:56:44] * elukey coffee, brutal start of the morning :D [08:14:58] 10Analytics, 10I18n, 10RTL: Support right-to-left languages in Wikistats - https://phabricator.wikimedia.org/T251376 (10Amire80) [09:37:52] elukey: I cannot kinit on stat1005 anymore but it works on stat1007. I get "kinit: Failed to store credentials: No credentials cache found". any idea what could be the reason? [09:39:17] mgerlach: hey! Sorry I am working on it, I thought I was alone, I am changing some settings and I made a mistake. One min and it should be fixed [09:40:38] elukey: thanks [09:43:15] mgerlach: ok so you should log off and login again [09:43:17] it should work [09:43:39] one thing - I am trying to change the default location of the kerberos credential cache [09:43:43] so something might not work [09:44:07] for example, if possible, I'd ask you to stop completely your notebook and start it again [09:44:11] elukey: checked - it works. thanks again. [10:03:44] mgerlach: let me know if anything doesn't work, you are now a tester of the new settings :D [10:04:14] elukey so far so good ; ) [10:05:57] mgerlach: I am having issues with pyspark in notebooks, so you'll probably see some as well :( [10:07:17] for me pyspark+notebooks actually works [10:09:02] mgerlach: did you start a new one or kept using the old one? [10:09:05] a running one sorry [10:09:15] because it might be still using the old credentials [10:10:51] probably the old one. but it asked for my credentials at some point and failed (thats when I pinged you); after re-entering it worked again [10:39:58] * elukey lunch! [10:52:40] Hi folks- Wow a lot of errors :( [10:58:20] !log Rerun wikidata-articleplaceholder_metrics-wf-2020-5-6 [10:58:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:00:41] !log Moving application_1583418280867_334532 to the nice queue [11:00:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:14:55] 10Analytics, 10Analytics-Kanban: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10JAllemandou) I think we should fill holes with 0s (that's actually the meaning of the hole). [11:17:42] (03CR) 10Joal: "We need to discuss how we want to handle data-dependency here. I don't think we have a datasets file for events yet. Should we have one? S" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594719 (https://phabricator.wikimedia.org/T249773) (owner: 10Milimetric) [11:33:30] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) [11:51:41] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-druid1001.eqiad.wmnet'... [11:52:21] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-druid1002.eqiad.wmnet ` The log... [12:05:15] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-druid1001.eqiad.wmnet ` The log... [12:08:40] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-druid1002.eqiad.wmnet'... [12:17:25] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-druid1001.eqiad.wmnet'... [12:27:06] so bad news from /run/user/etc.. [12:27:22] jupyter doesn't support it by default, tried to pass the env var around but needs a bit more testing [12:27:31] I fear that this issue will repeat with multiple tools [12:27:32] sigh [12:27:58] mgerlach: I am reverting my change on stat1005, you might need to kinit again etc.. sorry :( [12:28:20] * joal sends a lot of love to elukey, and some meat to feed the 3-headed dog :S [12:31:55] yeah it seems that everything is a problem if outside /tmp/krb [12:31:56] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) [12:32:24] fdans: when you have a moment I'd like to ask you a question about the geoip archive stuff [12:32:43] elukey: batcave? [12:33:32] fdans: here is fine, should be quick :) [12:33:40] elukey: fire away! [12:34:01] so I am trying to complete the refactoring of the roles for the stat boxes [12:34:09] last thing standing is the profile for geoip archive [12:34:28] I'd love to move it on an-launcher, since IIRC that thing pushes to hdfs [12:34:37] so there is no real reason to have it on stat1007 right? [12:44:01] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-druid1001.... [12:46:41] elukey: yeah, as long as it can fetch the newest version of the db without issues there should be no problem [12:49:27] super thanks :) [13:01:21] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-druid1002.... [13:03:29] joal: is 00/4:45:00 on purpose? [13:04:07] elukey: different times of days? [13:04:20] elukey: I don't master those interval things :S [13:05:17] so that should be hour:minute:second [13:05:33] and usually the /something are to execute every something time [13:05:45] I think that the above means every 4 hours [13:05:47] or similar [13:05:57] what is the target that you have in mind? [13:06:12] elukey: exactly that - every 4 hour, but a different minute [13:06:29] elukey: mimic of webrequest (hourly data) [13:09:16] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1001.eqiad.wmnet'] ` and were **ALL** successful. [13:09:33] joal: okok perfect [13:09:35] merging [13:10:14] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` druid1007.eqi... [13:10:41] elukey: Many thanks :) I would be happy to change if you prefer! [13:12:14] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` druid1008.eqiad.wmnet ` The log can... [13:15:33] joal: nono just wanted to check if it was the intended interval, we can change it in the future in case [13:15:45] Ack elukey - T [13:16:05] elukey: thanks for taking the time - Let me know if I can help with anything :S [13:16:15] sure! [13:16:19] I am deploying now [13:27:35] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1002.eqiad.wmnet'] ` and were **ALL** successful. [13:33:11] (03CR) 10Ottomata: "Hm. Refine will write the _REFINED flag, and generally that will mean things are ready to go. Refine does wait 2 hours before attempting" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594719 (https://phabricator.wikimedia.org/T249773) (owner: 10Milimetric) [13:35:27] ottomata: morningggg [13:35:35] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594941/ - green light?? :D [13:35:43] (stat1007 to role explorer!) [13:36:13] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['druid1007.eqiad.wmnet'] ` and were **ALL** successful. [13:36:32] WOWO [13:36:34] go for it luca [13:36:35] is that the last one? [13:36:42] or are there still some conditionals floating around? [13:37:55] luca how about greenlight for [13:37:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/594565 [13:37:56] ? [13:37:57] : [13:37:58] :) [13:38:14] Right now it will only start importing test.event and some eventgate error topics [13:38:17] ottomata: so there are some conditionals around that I had to make, but I have some ideas about how to fix them :) [13:38:27] cooo [13:38:52] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['druid1008.eqiad.wmnet'] ` and were **ALL** successful. [13:39:56] ottomata: I completely lost the dynamic stuff [13:40:03] hah [13:40:30] elukey: [13:40:31] https://gerrit.wikimedia.org/r/c/analytics/refinery/+/593047 [13:41:07] it makes the camus wrapper request from meta.wikimedia.org/w/api.php?action=streamconfigs... when it is run [13:41:09] and use that to set [13:41:13] -Dkafka.whitelist.topics [13:41:16] ahh you extended the camus wrapper! okok now it makes more sense :D [13:41:28] yesyes looks good [13:41:36] a bit magical but it'll do [13:41:44] until we figure out how to clean the entire thing up [13:41:55] doesn't make things better, but also not worse ¯\_(ツ)_/¯ [13:42:20] k merging that will make sure it works ok [13:47:02] Gone buying food - hopefully back on time for standup [13:49:45] wow this old laptop's keyboard is much easier to type on than my newer one [13:50:19] ottomata: one thing that I forgot yesterday - is it ok if I bump the yarn nm's heap to 6G (and reduce the max mem available for containers accordingly) ? [13:50:30] sure! [13:50:33] there seems to be a OOM issue sometimes with spark shuffling [13:50:39] and the nm gets to OOM :( [13:50:46] hm, does sounds worth it but maybe risky too? [13:50:56] what would we be reducing the container max mem to? [13:51:09] reducing it by 2G [13:51:18] basically the diff between 4G->6G [13:51:33] so -2G [13:51:55] that means that jobs won't be able to request more than 4G per worker, right? [13:52:01] maybe fine, but also maybe some jobs need that? [13:52:13] worth a try, might want to check with joseph about that too [13:52:32] mmm wait a sec, I am not getting why [13:52:40] I mean the 4G limit per worker [13:53:06] !log move stat1007 to role::statistics::explorer (adding jupyterhub) [13:53:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:53:12] the spark executor runs in a container, no? [13:53:18] and also other mappers et.c [13:53:30] you'd be reducing the max mem per container from 6G to 4G? [13:53:31] right? [13:53:43] (brb) [13:54:39] ottomata: ahhh nono sorry, what I want to do is raise the NM heap max size to 6G [13:54:59] and reduce yarn_scheduler_maximum_allocation_mb: 53248 to 53248 - 2G [13:56:31] but now that I think about it, we have different nodes, that is not the value [13:57:36] b [14:00:10] right but by reducing yarn_scheduler_maximum_allocation_mb, you will be reducing the memory avaiable to any given e.g. map task, right? [14:00:30] so, if there is currently some map task that needs 53248 to work, it will no longer be able to get that much [14:00:32] and might fail [14:01:12] like, what happens if someone rusn spark with --executor-memory 53248M [14:01:13] ? [14:01:33] we can argue that having a worker with 50G of ram is a little bit too much :D [14:01:46] OH THAT IS G?? [14:02:01] right. [14:02:01] ok [14:02:03] np [14:02:20] sounds fine. [14:02:24] hah [14:02:26] yes proceed! [14:02:36] but I am not remembering very well all those calculations about space etc.. so I'll recheck, what I want to do is bump the NM heap size :D [14:02:47] hey teaaaammmm, more alarrrmmmms! [14:05:36] elukey: sounds good [14:05:47] sorry I misunderstood for a sec what you were reducing [14:05:53] +1 to your change, should be fine [14:11:52] 10Analytics, 10Analytics-Kanban: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10mforns) @Nuria I agree with @JAllemandou. I think we should fill in a default value, so that we can still use the existing data and alert accordingly. Usually 0 is a good... [14:13:17] aaaand stat1007 done! [14:13:31] still not perfect but happy about the resul [14:13:35] *result [14:18:00] :D [14:21:39] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) [14:21:54] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) 05Open→03Resolved [14:53:19] Yay! Back in time :) [14:53:29] shall we do standup or staff-meeting tema? [14:54:38] 10Analytics: Decomission notebook hosts - https://phabricator.wikimedia.org/T249752 (10elukey) 05Stalled→03Open [14:54:40] 10Analytics, 10Analytics-Kanban: Unify puppet roles for stat and notebook hosts - https://phabricator.wikimedia.org/T243934 (10elukey) [14:59:31] a-team: standup or staff meeting (same q as joal) [15:00:58] ok - staff meeting :) [15:01:24] oh i guess staff ya [15:31:52] 10Analytics, 10Analytics-Cluster: Monitoring GPU Usage on stat Machines - https://phabricator.wikimedia.org/T251938 (10elukey) @Aroraakhil there are two ways of checking metrics: 1) `sudo radeontop` 2) https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu?orgId=1 rocm-smi is unfortunately a python script th... [15:33:21] 10Analytics, 10good first task: Javascript-less Wikistats - https://phabricator.wikimedia.org/T251979 (10Milimetric) p:05Triage→03Medium Cool would be some kind of Server Side Rendering (even snapshot) :) Which I'm for. Let's do it. [15:34:47] 10Analytics, 10Analytics-Cluster: Monitoring GPU Usage on stat Machines - https://phabricator.wikimedia.org/T251938 (10Milimetric) 05Open→03Resolved a:03Milimetric [15:35:22] 10Analytics, 10Analytics-Cluster: Monitoring GPU Usage on stat Machines - https://phabricator.wikimedia.org/T251938 (10Milimetric) Documented on wikitech https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU [15:38:34] 10Analytics, 10Analytics-Kanban: Tune up thresholds of data quality hourly alarms - https://phabricator.wikimedia.org/T251814 (10Milimetric) p:05Triage→03High [15:39:50] 10Analytics, 10Analytics-Cluster, 10Analytics-Wikistats: Add proper trend numbers to wikistats metrics - https://phabricator.wikimedia.org/T251813 (10Milimetric) p:05Triage→03Medium [15:40:12] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-Vagrant: eventlogging vagrant role: 'ParsedRequirement' object has no attribute 'req' - https://phabricator.wikimedia.org/T251864 (10Milimetric) 05Open→03Declined We're putting most effort on MEP and the new flow. This is the python side of EventLogging... [15:40:49] 10Analytics, 10Analytics-Kanban: Troubleshoot EventLogging sanitization immediate - https://phabricator.wikimedia.org/T251794 (10Milimetric) p:05Triage→03High [15:42:19] 10Analytics, 10Analytics-Kanban: Add folder creation for sqoop initial installation in puppet - https://phabricator.wikimedia.org/T251788 (10Milimetric) p:05Triage→03High a:03fdans Debate: could be the script or puppet that creates the folder. [15:44:49] 10Analytics: Cannot see SQL lab tab on UI - https://phabricator.wikimedia.org/T251787 (10Milimetric) 05Open→03Resolved p:05Triage→03High a:03elukey The superset admin user bug thing came up again. Luca re-fixed it. [15:45:52] joal: ouch https://yarn.wikimedia.org/cluster/app/application_1583418280867_333836 :( [15:46:00] 10Analytics: Creation of canonical pageview dumps for users to download - https://phabricator.wikimedia.org/T251777 (10Milimetric) p:05Triage→03High [15:46:26] hm [15:46:38] elukey: too much pressure on cluster I think :( [15:46:39] after 8h /o\ [15:46:53] elukey: at the end :) [15:47:03] elukey: will look into logs to make sure [15:47:15] elukey: Thanks for the heads up! [15:48:26] also Cc: mforns, denormalize failed :( - see yarn link above [15:48:36] ok [15:48:40] elukey: shuffle errors [15:48:48] plenty of them all along [15:48:57] buuuu [15:49:02] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics: EditAttemptStep sent event with "ready_timing": -18446744073709543000 - https://phabricator.wikimedia.org/T251772 (10Milimetric) +1 @mpopov, the client should never send negative numbers, no matter what the browsers are telling you :) I'd come up... [15:49:02] but what kind of errors? OOM? [15:49:04] elukey: can we release the patch and roll-restart before restarting the job? [15:49:10] sure sure [15:50:06] elukey: from analytics1047.eqiad.wmnet (at least some) [15:50:40] 10Analytics, 10Analytics-Kanban: Add page_restrictions table to sqoop list - https://phabricator.wikimedia.org/T251749 (10Milimetric) a:03JAllemandou [15:51:46] 10Analytics, 10Analytics-Kanban: check leftovers of jmorgan - https://phabricator.wikimedia.org/T251600 (10Milimetric) a:03elukey with honorable mention of @mforns to do any cleanup in HDFS [15:52:19] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Remove North Korea from data quality traffic entropy reports - https://phabricator.wikimedia.org/T251546 (10Milimetric) p:05Triage→03High [15:52:33] 10Analytics, 10Analytics-Kanban: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10Milimetric) p:05Triage→03High [15:53:35] elukey: the error is java.lang.NullPointerException - this is unexpected :( [15:54:54] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics: EditAttemptStep sent event with "ready_timing": -18446744073709543000 - https://phabricator.wikimedia.org/T251772 (10Ottomata) Ya I betcha you could add `minimum: 0` to the field. https://json-schema.org/understanding-json-schema/reference/numeric... [15:56:20] 10Analytics, 10I18n, 10RTL: Support right-to-left languages in Wikistats - https://phabricator.wikimedia.org/T251376 (10Milimetric) p:05Triage→03Medium I have thoughts about collaborating on this with the more mainstream effort of finding/building a design system for mediawiki (part of the slow migration... [15:57:54] 10Analytics, 10I18n, 10RTL, 10good first task: Support right-to-left languages in Wikistats - https://phabricator.wikimedia.org/T251376 (10Milimetric) [15:59:30] 10Analytics: Change Wikistats UI language without reloading the page - https://phabricator.wikimedia.org/T251375 (10Milimetric) p:05Triage→03High [16:01:13] 10Analytics, 10Better Use Of Data, 10Product-Analytics: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10Milimetric) p:05Triage→03High [16:01:18] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 (10Ottomata) 05Open→03Declined After discussion the Event Platform Engineering sync yesterday, we all agreed that this is hard an... [16:01:21] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Automate ingestion and refinement into Hive of event data from Kafka - https://phabricator.wikimedia.org/T251609 (10Ottomata) [16:01:23] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Refactor EventBus mediawiki configuration - https://phabricator.wikimedia.org/T229863 (10Ottomata) [16:01:55] something feels wrong - after 8h the job should be almost finished (or well advanced - And it seems not so advanced :( [16:02:00] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, 10Product-Analytics: Augment Hive event data with normalized host info from meta.domain - https://phabricator.wikimedia.org/T251320 (10Ottomata) [16:02:36] 10Analytics, 10Analytics-Kanban: Corrupted parquet statistics when querying webrequest data via Superset/Presto - https://phabricator.wikimedia.org/T251231 (10Milimetric) a:03elukey [16:02:55] 10Analytics, 10Analytics-Kanban: Corrupted parquet statistics when querying webrequest data via Superset/Presto - https://phabricator.wikimedia.org/T251231 (10Milimetric) p:05Triage→03High a:05elukey→03Nuria [16:04:21] 10Analytics, 10Fundraising-Analysis: Statistics on a CN banner - https://phabricator.wikimedia.org/T251177 (10Milimetric) [16:05:13] 10Analytics: Decomission notebook hosts - https://phabricator.wikimedia.org/T249752 (10Addshore) Will there be any automatic rsync / backup from the notebook hosts for all users? Or is that something I'll have to take care of myself? [16:05:28] 10Analytics: Idea: Add 'top X bigger than Y' sanitization method to EL-to-Druid - https://phabricator.wikimedia.org/T251145 (10Milimetric) Describe more. Do you mean per metric, dimension, etc? [16:06:07] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10Milimetric) p:05Triage→03High [16:06:17] 10Analytics: Decomission notebook hosts - https://phabricator.wikimedia.org/T249752 (10elukey) >>! In T249752#6116532, @Addshore wrote: > Will there be any automatic rsync / backup from the notebook hosts for all users? > Or is that something I'll have to take care of myself? Managed by users since it requires... [16:07:45] joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594992/ [16:08:23] elukey: we need another change I think [16:08:44] sure which one? [16:09:37] the change from 53248 to 49152 is about max-allocation for a single container - We need to change the available memory by nodeManager I think (yarn_nodemanager_resource_memory_mb) [16:09:53] 10Analytics, 10EventStreams: EventStreams socket stays connected without any traffic incoming - https://phabricator.wikimedia.org/T250912 (10Milimetric) p:05Triage→03High Thanks for the report, can you please link us to the client code? This is a python bot? How often does it happen? Every time after X... [16:09:58] joal: I thought the same but I didn't find any setting about it [16:10:02] elukey: --^ [16:10:11] elukey: It's by node - maybe a different place? [16:10:51] 10Analytics: Anomaly detection alarms for the edit event stream - https://phabricator.wikimedia.org/T250845 (10Milimetric) p:05Triage→03High [16:11:03] ah it maybe auto-calculated [16:11:19] because we have different nodes [16:11:26] this is why I didn't find it probably [16:11:29] could be - I'm interested by the formula (or the place) :) [16:12:15] 10Analytics: MEP: canary alarms so we know events are flowing through pipeline - https://phabricator.wikimedia.org/T250844 (10Milimetric) p:05Triage→03Medium [16:12:26] 10Analytics: MEP: canary events so we know events are flowing through pipeline - https://phabricator.wikimedia.org/T250844 (10Milimetric) [16:13:19] 10Analytics: Release a public dataset about percentage of referrers in wikipedia traffic - https://phabricator.wikimedia.org/T250840 (10Milimetric) p:05Triage→03Low [16:14:23] elukey: passed to hadoop.pp, undefined in (so defined somewhere else) [16:14:52] I am confused, I don't find where it gets defined [16:16:23] 10Analytics: Unique devices, retrofit with bot detection code - https://phabricator.wikimedia.org/T250744 (10Milimetric) p:05Triage→03Medium Good to let the pageview detection bake for a bit before doing this. [16:16:31] 10Analytics: We should get an alarm for partitions that have no data for topics that have data influx at all times, most of the mediawiki.* - https://phabricator.wikimedia.org/T250699 (10Milimetric) p:05Triage→03High [16:17:04] ahhhh [16:17:06] we set yarn_nodemanager_os_reserved_memory_mb [16:17:16] and we have [16:17:17] $yarn_nodemanager_resource_memory_mb = $hadoop_config['yarn_nodemanager_os_reserved_memory_mb'] ? { [16:17:20] undef => undef, [16:17:23] default => floor($facts['memorysize_mb']) - $hadoop_config['yarn_nodemanager_os_reserved_memory_mb'], [16:17:26] } [16:17:26] ok swapping completed :D [16:17:28] yes yes [16:17:39] Thanks elukey <3 [16:17:45] Where is it? [16:18:18] 10Analytics: Verify if Superset can authenticate to Druid via TLS/Kerberos - https://phabricator.wikimedia.org/T250487 (10Milimetric) p:05Triage→03Medium [16:18:26] it is in hieradata/common.yaml [16:18:30] Meh [16:18:33] ok :) [16:18:39] 10Analytics: Verify if Turnilo can pull data from Druid using Kerberos/TLS - https://phabricator.wikimedia.org/T250485 (10Milimetric) p:05Triage→03Medium [16:23:36] joal: ok updated [16:23:45] I think the cluster business is at the gist of our problem - There have been retries all along the job [16:24:13] reading elukey [16:24:42] I am running pcc to see changes [16:25:25] elukey: dumb idea while we are at it: can you add a comment line 714 to be able to more easily find yarn_nodemanager_resource_memory formula? [16:25:32] ack elukey [16:25:39] joal: how dare you asking such things [16:25:41] :P [16:25:46] elukey: :) [16:26:05] Future-me is actually tapping my shoulder, that's why I ask [16:26:47] I also changed the value in the wrong place (test cluster) [16:27:30] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Technical contributors emerging communities metric definition, thick data - https://phabricator.wikimedia.org/T250284 (10Milimetric) p:05Triage→03High [16:27:35] Wow good catch elukey ! [16:27:41] Didn't notice elukey [16:28:01] I looked at that in the previous patch and did not do it on that one (bad joal) [16:28:23] 10Analytics: Kerberos-run-command doesn't work with spark-submit [workaround] - https://phabricator.wikimedia.org/T250161 (10Milimetric) p:05Triage→03High [16:29:39] heya elukey and joal can I help with anything? [16:29:56] mforns: denormalize kaput :( [16:30:05] reading [16:30:06] joal: new version ready [16:30:15] ack elukey [16:31:18] +1ed elukey [16:31:40] pcc https://puppet-compiler.wmflabs.org/compiler1003/22398/ [16:31:44] I'm gonna monitor the job tonight [16:32:49] joal: ok merging, running puppet and roll restarting [16:32:52] ETA 10 mins [16:32:54] pfff-what a beginning of month [16:33:23] ack elukey - Thanks again a milion - I'm gonna share a virtual beer with you tonight [16:33:51] <3 [16:35:05] elukey: That rsync-between-client link is a must have :) [16:37:22] !log roll restart of all the nodemanagers on the hadoop cluster to pick up new jvm settings [16:37:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:44:30] joal: https://grafana.wikimedia.org/d/000000585/hadoop?panelId=17&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-hadoop&var-worker=All&from=now-3h&to=now [16:44:39] so 8g might be more than we need, we can tune it later [16:44:54] all done, when you want to restart denormalize please go [16:46:43] going afk to do some gardening, checking later! (ping me on the phone if anything explodes) [16:49:40] !log Restart and babysit mediawiki-history-denormalize-wf-2020-04 [16:49:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:59:28] 10Analytics, 10Analytics-Kanban: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10Nuria) >I think we should fill holes with 0s (that's actually the meaning of the hole). The danger of doing that is that you get a stream of data with zeros that indicate... [17:22:07] @analytics-team postpone start of research-analytics meeting for half hour to go to tech monthly? if not, i can start ontime and watch tech monthly later [17:22:39] works for me isaacj [17:23:47] i'll take that for consensus unless i hear otherwise :) [17:26:29] * joal makes consensus with himself [17:29:02] the best kind! [17:34:28] (03PS1) 10Ottomata: bin/camus - check_java_opts should override extra_java_opts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595007 (https://phabricator.wikimedia.org/T251609) [17:35:47] (03CR) 10Ottomata: [V: 03+2 C: 03+2] "Merging this and deploying it to stop false positive camus failure report emails." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595007 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [17:36:04] 10Analytics, 10Analytics-Cluster: Monitoring GPU Usage on stat Machines - https://phabricator.wikimedia.org/T251938 (10Aroraakhil) @elukey thanks much for your response. However, none of these monitoring tools give information about the pids of the processes or the number of processes currently using the GPU.... [17:39:12] !log deploying fix to refinery bin/camus CamusPartitionChecker when using dynamic stream configs [17:39:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:39:33] 10Analytics: Release a public dataset about percentage of referrers in wikipedia traffic - https://phabricator.wikimedia.org/T250840 (10Nuria) [17:39:50] actually isaacj I'm gonna need to skip the meeting (family need) - Please reach out to me if needed! [17:40:31] joal: :thumbs up: [17:43:20] 10Analytics, 10Analytics-Cluster: Monitoring GPU Usage on stat Machines - https://phabricator.wikimedia.org/T251938 (10elukey) @Aroraakhil this is the output of rocm-smi (I executed manually via sudo) by default is: ` elukey@stat1005:~$ sudo /opt/rocm/bin/rocm-smi ========================ROCm System Manage... [17:43:36] * elukey off! [17:52:15] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-Vagrant: eventlogging vagrant role: 'ParsedRequirement' object has no attribute 'req' - https://phabricator.wikimedia.org/T251864 (10Nuria) See: https://phabricator.wikimedia.org/T238230 [17:54:12] 10Analytics, 10Analytics-Cluster: Monitoring GPU Usage on stat Machines - https://phabricator.wikimedia.org/T251938 (10Aroraakhil) @elukey thanks much for your prompt response. This is what I get from 'nvidia-smi' in our EPFL machine. As you can see it displays the number of processes currently running, and th... [18:28:43] 10Analytics: Add a "latest" partition to Hive tables - https://phabricator.wikimedia.org/T252148 (10Isaac) [18:44:00] 10Analytics: Add a "latest" partition to Hive tables - https://phabricator.wikimedia.org/T252148 (10Ottomata) I think this is a cool idea. @JAllemandou @elukey @Milimetric. What if every time we added a new Hive Partition, we'd select for the 'latest' one and then add another Hive partition pointing to its loc... [18:49:17] 10Analytics, 10Analytics-Kanban: check leftovers of jmorgan - https://phabricator.wikimedia.org/T251600 (10mforns) [x] Removed listed directores from HDFS. [18:58:11] (03PS1) 10Mforns: Add outreach.wikipedia and incubator.wikipedia to the pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595028 [18:59:37] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging to avoid alarms" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595028 (owner: 10Mforns) [19:38:17] ottomata: o/ im curious, is there still a simple event logging / event bus -> sql thing? [19:38:30] (for use on other wikis) [19:40:48] addshore: no not really [19:41:14] eventlogging still works, but we aren't targeting supporting sql like support for third parties [19:41:24] the old stuff should all still work though [20:10:09] Ah team - forgot to mention - Tomorrow is off in France - I'll keep an eye on ops but will mostly be not available [20:27:45] (03CR) 10Milimetric: "My sense is that it's too tricky to be very sure about streaming data. So you run with the best you have at the time, and when you get yo" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594719 (https://phabricator.wikimedia.org/T249773) (owner: 10Milimetric) [20:29:05] k joal :] [20:34:26] 10Analytics, 10Analytics-Kanban, 10Research, 10Patch-For-Review: Proposed adjustment to wmf.wikidata_item_page_link to better handle page moves - https://phabricator.wikimedia.org/T249773 (10Milimetric) > One other thing that I thought of that might speed up the query: I can never remember how snapshot dat... [20:34:28] (03CR) 10Ottomata: "Hm, I wouldn't call this 'streaming' data though, any more than webrequest is. The data is generated and imported here in pretty much exac" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594719 (https://phabricator.wikimedia.org/T249773) (owner: 10Milimetric) [21:17:16] (03PS3) 10Milimetric: Use page move events to improve joining to entity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/594428 (https://phabricator.wikimedia.org/T249773) [21:18:14] (03CR) 10Milimetric: "done testing and vetting, everything looks good except the page_namespace column is now bigint for whatever reason so there's a type misma" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/594428 (https://phabricator.wikimedia.org/T249773) (owner: 10Milimetric) [21:18:53] byaaaaa [21:21:53] yay, finished vetting. Now to look at this sqoop... [21:22:03] (03CR) 10Milimetric: "Hm... I don't know, I think this is just an artifact of how we're using the page move data right now. I think the idea here is to make it" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594719 (https://phabricator.wikimedia.org/T249773) (owner: 10Milimetric) [21:22:03] (03CR) 10Milimetric: "In any case, the new job code is vetted, this can be properly reviewed now, and I'll do as you two want, no real preference from me." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594719 (https://phabricator.wikimedia.org/T249773) (owner: 10Milimetric) [21:33:29] 10Analytics, 10Analytics-Kanban, 10Research, 10Patch-For-Review: Proposed adjustment to wmf.wikidata_item_page_link to better handle page moves - https://phabricator.wikimedia.org/T249773 (10Isaac) @Milimetric thanks for actually verifying my conjecturing around snapshot dates :) Based on the history for...