[03:40:49] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:34:23] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:04:11] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create the new Hadoop test cluster - https://phabricator.wikimedia.org/T255139 (10elukey) [08:15:49] Good morning [08:16:33] bonjour [08:26:21] (03CR) 10Joal: [C: 03+2] "Tested - Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629659 (https://phabricator.wikimedia.org/T263736) (owner: 10Joal) [08:31:43] (03Merged) 10jenkins-bot: Update MediawikiXMLDumpsConverter repartitioning [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629659 (https://phabricator.wikimedia.org/T263736) (owner: 10Joal) [08:38:28] (03PS1) 10Joal: Bump jar version for mediawiki/wikitext jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/636606 (https://phabricator.wikimedia.org/T263736) [08:51:54] (03PS4) 10Conniecc1: Add dimensions to editors_daily dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607361 (https://phabricator.wikimedia.org/T256050) [08:52:17] (03CR) 10Ayounsi: [C: 03+1] Add Refine transform function for Netflow data set (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [08:54:49] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10cchen) hi @JAllemandou, sorry for the delay. I thought my last edit was saved, but it's not published...i resubmitted the change. [09:01:14] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10JAllemandou) No problem @cchen - It's a shame if the patch stays stale while needed :) [09:06:51] (03CR) 10Joal: "Looks good! Can you please test and vet the data @Conniecc1? easiest would be to run a version of that query against current snapshots and" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607361 (https://phabricator.wikimedia.org/T256050) (owner: 10Conniecc1) [09:06:56] (03CR) 10Joal: [C: 03+1] Add dimensions to editors_daily dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607361 (https://phabricator.wikimedia.org/T256050) (owner: 10Conniecc1) [09:46:23] PROBLEM - eventlogging Varnishkafka log producer on cp4032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:46:43] PROBLEM - statsv Varnishkafka log producer on cp4032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:46:53] PROBLEM - Webrequests Varnishkafka log producer on cp4032 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:46:54] checking --^ [09:48:39] RECOVERY - Webrequests Varnishkafka log producer on cp4032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:49:30] Traffic working on the host --^ [09:49:43] ack [09:49:59] RECOVERY - eventlogging Varnishkafka log producer on cp4032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:50:17] RECOVERY - statsv Varnishkafka log producer on cp4032 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [10:01:34] 10Analytics-Radar, 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10MoritzMuehlenhoff) @gsingers We have three major types of NDA/MOU under which people get access to PII-sensitive data on our servers: * Everyone who's WMF staff has signed an NDA a... [10:14:43] Morning everyone! I have a massive migraine, I've taken some Ibuprofen and will rest a bit. Might be useful in the afternoon. [10:22:42] klausman: ouch ack, please rest :) [10:22:57] Aye aye, cap'n [11:36:56] * elukey lunch! [12:10:18] Wow TIL - https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html [12:11:40] This reminds me of my C classes with linkage issues [13:04:51] 10Analytics, 10MediaWiki-Page-editing, 10Platform Engineering, 10Product-Analytics, 10User-DannyS712: EditPage save hooks pass an entire `EditPage` object - https://phabricator.wikimedia.org/T251588 (10Ottomata) Ping @nettrom_WMF and @nshahquinn-wmf as per [[ https://meta.wikimedia.org/wiki/Schema_talk:E... [13:18:22] hello people, mediawiki is about to be switched back to eqiad [13:18:54] hello oh boy [13:22:20] 10Analytics-Clusters, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10elukey) All the nodes (analytics1042 -> 1057) have new ext4 partitions for /var/lib/hadoop/data/$letter. Next steps: 1) Reimage all the nodes (keeping Debian Stretch) 2) Review htt... [13:22:24] 10Analytics-Clusters, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10elukey) p:05Triage→03High a:05elukey→03razzi [13:28:24] ottomata: o/ - about https://gerrit.wikimedia.org/r/c/operations/puppet/+/636493 - does it work if we run it manually? (with vk set to stdout not kafka, or varnishncsa) [13:28:38] elukey: tested with varnishncsa [13:28:51] perfect, just wanted to double check, looks fine :) [13:28:59] ema is reviewing too [13:29:05] going to wait til after switch to merge [13:29:35] yep yep [13:31:19] elukey: hellooooo, let's do the deletion real quick? [13:32:08] sure [13:33:41] elukey: I'm in bc [13:33:56] coming sorry [13:52:43] 10Analytics, 10Research, 10Research-collaborations: Performance Issues when running Spark/Hive jobs via Jupyter Notebooks - https://phabricator.wikimedia.org/T258612 (10elukey) 05Open→03Resolved a:03elukey Closing this task since there seems to be no more action item left, please re-open if needed. [14:08:47] (03PS1) 10Joal: Fix maxmind UDFs for hive 2.3.3 (bigtop) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/636662 (https://phabricator.wikimedia.org/T266322) [14:08:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Possible issue between Maxmind and Hive 2.x libs in Refinery source - https://phabricator.wikimedia.org/T266322 (10JAllemandou) Following the stackoverflow link pasted in the task I have: - tried different versions of maxmind geoip2 without success - tri... [14:09:17] elukey: blocker unblocking :) [14:13:26] nice! [14:14:07] seems super nice [14:18:40] going afk to run some errands, will be back for standup :) [14:25:57] 10Analytics, 10Analytics-Kanban: Check whether mediawiki production event data is equivalent to mediawiki-history data over a month - https://phabricator.wikimedia.org/T262261 (10JAllemandou) Final note. I have found 2 problems with `mediawiki-events` (over `simplewiki` only, using `mediawiki-history` as a bas... [14:33:38] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Ottomata) https://gerrit.wikimedia.org/r/635304 has been merged and a new 1.1.0 version of mediawiki/client/erro... [15:35:05] I am back! [15:35:17] I thought standup was earlier, all good :) [15:40:20] 10Analytics, 10Analytics-Kanban, 10Event-Platform: eventgate-analytics-external occasionally seems to fail lookups of dynamic stream config from MW EventtStreamConfig API - https://phabricator.wikimedia.org/T266573 (10Ottomata) [15:43:17] (03CR) 10Elukey: [C: 03+1] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/636662 (https://phabricator.wikimedia.org/T266322) (owner: 10Joal) [15:43:42] fdans: o/ [15:44:01] elukey: hello [15:44:06] if you have time later on can we check the contents of thorium's /srv? [15:45:17] elukey: it's going to be a little hard for your timezone, I have meetings back to back until your 19:30 [15:45:51] ah okok, we can do tomorrow [15:46:41] a-team I have like 2.5 hours of meetings from now so I'm going to push the train to tomorrow, one hour before standup [15:46:49] 10Analytics-Kanban: Deprecate Python 2 software from the Analytics infrastructure - https://phabricator.wikimedia.org/T204734 (10elukey) a:05elukey→03None [15:47:02] ok fdans, sounds good [15:48:20] yep looks good, it gives us time to review https://gerrit.wikimedia.org/r/636662 [15:48:35] it would be great if we get the test cluster unblocked this week [15:54:26] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) [15:54:32] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Services (watching): Modern Event Platform: Schema Repostories - https://phabricator.wikimedia.org/T201063 (10Ottomata) 05Open→03Resolved [15:55:23] 10Analytics, 10Analytics-Wikistats: Reindex mediawiki_history_reduced with lookups - https://phabricator.wikimedia.org/T193650 (10Milimetric) a:05Milimetric→03None [15:56:32] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Services (watching): Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201068 (10Ottomata) 05Open→03Resolved [15:56:37] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) [15:57:35] 10Analytics, 10CirrusSearch, 10Discovery, 10Discovery-Search: Ingest cirrussearchrequest data into druid - https://phabricator.wikimedia.org/T218347 (10Ottomata) 05Open→03Declined I think we are not going to to this, right? Especially now that this data can be accessed via Presto in Superset. Declini... [15:57:39] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Make Refine use JSONSchemas of event data to support Map types and proper types for integers vs decimals - https://phabricator.wikimedia.org/T215442 (10Ottomata) [15:57:41] 10Analytics: [HiveToDruid] Add support for ingesting subfields of map columns - https://phabricator.wikimedia.org/T208589 (10Ottomata) [15:57:58] 10Analytics: Check home/HDFS leftovers of rodolfovalentim - https://phabricator.wikimedia.org/T266467 (10elukey) ` ====== stat1004 ====== total 0 ls: cannot access '/var/userarchive/rodolfovalentim.tar.bz2': No such file or directory ====== stat1005 ====== total 184 drwxr-xr-x 2 24415 wikidev 4096 Jul 13 01:1... [15:59:34] 10Analytics: Check home/HDFS leftovers of leila - https://phabricator.wikimedia.org/T264994 (10elukey) 05Open→03Resolved a:03elukey Tables dropped by Francisco today, this task is completed :) @leila please check that everything looks good :) [15:59:44] 10Analytics, 10Event-Platform, 10Product-Analytics: Define how we vet code & data for ongoing, automated ingestion in Druid - https://phabricator.wikimedia.org/T210012 (10Ottomata) 05Open→03Declined We wont' be automating ingesting event data into Druid now that it is queryable via Presto and Superset.... [15:59:51] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 6 others: Modern Event Platform: Schema Guidelines and Conventions - https://phabricator.wikimedia.org/T214093 (10Ottomata) [16:07:17] 10Analytics: Purchase of GPUs to help support the open source software stack on top of AMD GPUs (donation to Debian) - https://phabricator.wikimedia.org/T241192 (10elukey) 05Open→03Declined As far as I got this is not needed anymore, please reopen if necessary! [16:07:19] 10Analytics, 10Analytics-Kanban, 10User-Elukey: New Hadoop hardware. Refreshes and hosts with space for GPUs - https://phabricator.wikimedia.org/T241190 (10elukey) [16:10:28] (03CR) 10Joal: "I also confirm this doesn't break the prod cluster" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/636662 (https://phabricator.wikimedia.org/T266322) (owner: 10Joal) [16:27:55] joal: sorry I can jumb back! [16:27:59] *jump [16:28:10] razzi: as I was falling asleep last night I think I realize why our curl --resolve wasn't doing what we wanted! [16:28:14] i haven't verified this [16:28:14] buut [16:28:17] elukey: talking with mforns - will ping you when I have time :) [16:28:32] the caches have two layers, a frontend and a backend [16:28:37] so the htttp req goes like [16:28:48] client -> cache frontend -> cache backend -> service [16:29:12] our --resolve was working to make the request go to cp1075 cache frontend [16:29:29] but then the req went to a random cache backend, which did not yet have the puppet change applied to it, and still used service :443 [16:29:48] only when the random cache backend ended up being :8443, did it do what we expected [16:32:18] :8443 was only configured on cp1075 [16:32:22] hm, why would a random cache backend be :8443, before we rolled it out to more than cp1075? [16:32:30] it wasn't [16:32:52] thats why we didn't get the request on 8443 in tcpdump on thorium [16:33:01] most of the time we were getting [16:33:05] mforns: link? [16:33:20] joal: https://meet.google.com/kti-iybt-ekv [16:33:33] client -> cp1075 -> (e.g.) cp1089 -> service:443 [16:33:34] thanks mforns [16:33:40] our --resolve made the req go to cp1057 frontend [16:33:44] cp1075* [16:33:50] the cache backend was reandom [16:33:57] only when it happened to also be cp1075 [16:33:57] like [16:34:15] clinet -> cp1075 -> cp1075 -> service:8443 [16:34:21] did it actually go to 8443 [16:34:48] ok, so a cace frontend can also be a cache backend? [16:34:53] *cache [16:34:58] yes, two different processes on the same nodes [16:35:15] https://wikitech.wikimedia.org/wiki/Caching_overview [16:35:41] and more confusingly, one is varnish (the frontend after TLS termination) and the other one is ATS (all backends are) [16:35:44] Was just going to look for that article :) [16:36:28] yeah and i was confused yesterday because i was remembering ATS was for frontends...but that is only for TLS termination [16:36:30] so REALLY [16:36:30] ;it is [16:37:13] https client -> ATS for TLS -> Varnish http frontend cache  -> ATS http backend cache -> https service [16:37:24] haha and in our case, that https service expands to [16:37:38] https envoy -> http apache [16:37:43] a crazy roller coaster [16:44:34] razzi: I have added some comments to the code reviews opened, lemme know if you want to talk about hadoop or if you are already set [16:46:29] elukey: Ready! [16:46:41] elukey: batcave is busy - see marcel's link above :) [16:48:00] joal: tardis? [16:49:55] elukey: yeah, let's talk hadoop when you get a moment [16:51:18] hola! everyone mforns : yes, finally I have been able to throughly test the changes for alarms, i figured out what was wron [16:51:21] *wrong [16:51:36] mforns: which , as expected, was some ridiculous thing [16:52:03] hey nuria :] [16:52:12] mforns: we can talk later, after meetings [16:52:41] nuria: we're done with common meetings [16:52:45] for today [16:52:47] mforns: ah ok [16:53:25] mforns: so issue was that the way the "updater" job works requires to override the tmp directory cause user is not taken into account when building it [16:53:33] mforns: so -Dtemp_directory='hdfs://analytics-hadoop/tmp/nuria/' \ [16:54:00] aaaaahhh [16:54:02] yes [16:54:12] mforns: it will default to /tmp/analytics [16:54:22] I see [16:54:26] mforns: unless overriden and thus things do not work [16:54:51] mforns: I will update docs on how to run job [16:54:55] mforns: [16:54:56] https://www.irccloud.com/pastebin/quKDZq4F/ [16:55:19] nuria: you could put it in the README [16:55:23] mforns: i can update spark job too to take user into account but whatever you think will work best [16:55:37] mforns: ya, i will update docs and put it on README [16:55:44] but nuria, I thought you're already on vacation? [16:55:55] mforns: yes, but this was killing me [16:56:00] xD [16:56:05] mforns: and i wanted to finish it [16:56:16] OK, but don't worry, I can continue from where you left it [16:56:36] https://usercontent.irccloud-cdn.com/file/tLINbHiX/Screen%20Shot%202020-10-27%20at%209.56.19%20AM.png [16:57:03] awesome! [16:57:08] these are the two timeseries [16:57:17] now i just need a suitable threshold [16:57:36] as you can see works great to identify issue [16:57:43] yes, it does! [16:58:14] mforns: will test thresholds today a bit more and commit final changeset [16:58:31] ok, great :] [16:58:42] mforns: the blogpost i will write as volunteer [17:03:36] nuria: OK, do you want to write it alone and then I review it, or pair-writing? [17:07:42] 10Analytics, 10Analytics-Wikistats, 10Product-Analytics: Mysterious anonymous content page creations on English Wikipedia according to stats.wikimedia.org - https://phabricator.wikimedia.org/T266578 (10kaldari) [17:13:06] 10Analytics, 10Product-Analytics: Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history - https://phabricator.wikimedia.org/T266374 (10LGoto) p:05Triage→03Medium a:03nettrom_WMF [17:16:38] 10Analytics, 10Analytics-Wikistats, 10Product-Analytics: Mysterious anonymous content page creations on English Wikipedia according to stats.wikimedia.org - https://phabricator.wikimedia.org/T266578 (10kzimmerman) For reference of scale, we get ~600 pages created a day, so the scale of this is low. Currently... [17:18:47] 10Analytics, 10MediaWiki-Page-editing, 10Platform Engineering, 10Product-Analytics, 10User-DannyS712: EditPage save hooks pass an entire `EditPage` object - https://phabricator.wikimedia.org/T251588 (10kzimmerman) Adding @MNeisler to make sure Editing is aware. [17:26:01] milimetric: you submitting puppet pattch for revert of active dc checks? [17:26:09] yes [17:26:12] k [17:26:13] i will merge [17:27:45] k, it's up ottomata, take a look, the codfw|eqiad syntax shows up in other places, shall we refactor those too? I didn't look too closely [17:29:22] milimetric: the only one i see is for the test topic, and those will always have data [17:29:25] in both DCs [17:30:28] razzi: ok I am ready [17:31:04] elukey: k, 1 minute [17:31:12] 10Analytics, 10Analytics-Wikistats, 10Product-Analytics: Mysterious anonymous content page creations on English Wikipedia according to stats.wikimedia.org - https://phabricator.wikimedia.org/T266578 (10Ammarpad) If an anonymous user creates article in draft namespace and it's later moved (with all the histor... [17:32:04] 10Analytics, 10Analytics-Wikistats, 10Product-Analytics: Mysterious anonymous content page creations on English Wikipedia according to stats.wikimedia.org - https://phabricator.wikimedia.org/T266578 (10kaldari) 05Open→03Resolved a:03kaldari @Ammarpad - Ah, that makes sense! Thanks for clearing up the m... [17:33:20] (03CR) 10Joal: [C: 03+2] "Merging for later deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/636662 (https://phabricator.wikimedia.org/T266322) (owner: 10Joal) [17:37:45] 10Analytics, 10Editing-team, 10MediaWiki-Page-editing, 10Platform Engineering, and 2 others: EditPage save hooks pass an entire `EditPage` object - https://phabricator.wikimedia.org/T251588 (10MNeisler) [17:38:00] !log restrict Fuzz Faster U Fool user agents from submittnig eventlogging legacy systemd data - T266130 [17:38:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:38:03] T266130: Filter out EventLogging data with bunk user-agents - https://phabricator.wikimedia.org/T266130 [17:38:29] (03PS1) 10Joal: Bump hive-jar of webrequest-load job to v0.0.138 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/636721 (https://phabricator.wikimedia.org/T266322) [17:39:39] (03Merged) 10jenkins-bot: Fix maxmind UDFs for hive 2.3.3 (bigtop) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/636662 (https://phabricator.wikimedia.org/T266322) (owner: 10Joal) [17:43:07] 10Analytics, 10Editing-team, 10MediaWiki-Page-editing, 10Platform Engineering, and 2 others: EditPage save hooks pass an entire `EditPage` object - https://phabricator.wikimedia.org/T251588 (10MNeisler) Yes, confirming that [[ https://meta.wikimedia.org/wiki/Schema:EditAttemptStep | EditAttemptStep ]] is... [18:28:36] mforns: we can co-write, thus far I only have the subtitle: "monitoring wikipedia's accessibility around the world" [18:29:03] mforns: https://docs.google.com/document/d/1w4VUujPdLt8NObzlPuHHp3MJ1eUfOKEl3OT3gXnPC9Y/edit [18:30:18] PROBLEM - Disk space on Hadoop worker on an-worker1101 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/s 14 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:30:34] mmmmmmm [18:30:57] so this is one of the workers with gpus [18:31:01] with smaller disks [18:31:53] https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=103&orgId=1 [18:32:01] (2TBs instead of 4TBs) [18:32:31] looking too! [18:34:49] I don't see the datanode being hammered from https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-hadoop&var-worker=an-worker1101 [18:34:53] (datanode panel) [18:34:54] maybe yarn? [18:37:20] partition completely full, but only that one [18:38:46] root@an-worker1101:/var/lib/hadoop/data/s/yarn# du -sh * [18:38:46] 539M local [18:38:46] 331G logs [18:39:00] 331G of logs?? [18:39:16] all from application_1601916545561_83072 [18:39:34] i was just looking at that one too luca [18:39:41] goran job? [18:39:42] https://yarn.wikimedia.org/cluster/app/application_1601916545561_83072 [18:39:45] yeah [18:39:55] it looks like it is doing a big toPandas [18:39:57] no idea what it is doing [18:40:09] which is going to collect the result into a spark master somewhere [18:40:16] but, dnno why that'd make so much logs [18:40:20] https://yarn.wikimedia.org/proxy/application_1601916545561_83072/jobs/job/?id=0 [18:40:29] root@an-worker1101:/var/lib/hadoop/data/s/yarn/logs/application_1601916545561_83072/container_e25_1601916545561_83072_01_000152# du -hs * [18:40:32] 331G stderr [18:41:25] I see a ton of [18:41:25] 20/10/27 18:40:10 DEBUG AvroDeserializer: File schema union could not resolve union. fileSchema = ["null","bytes"], recordSchema = ["null","string"], datum class = org.apache.avro.util.Utf8: org.apache.avro.UnresolvedUnionException: Not in union ["null","bytes"]: S [18:41:30] org.apache.avro.UnresolvedUnionException: Not in union ["null","bytes"]: S [18:41:33] GoranSM: --^ [18:41:43] online by any chance? :) [18:43:39] elukey: we could kill the app, we could also possibly try manually killing the container process on the node [18:43:41] ottomata: the job seems erroring a lot, I am wondering if there is the risk of saturating other partitions [18:43:45] spark might restart it elsewhere... ight releive this smaller disk? [18:43:48] ahahha same time [18:44:10] it is spamming like crazy, I suspect something is wrong with the job [18:44:13] yeah [18:44:26] but it seems to only be this container...i guess? [18:44:32] maybe it hs happening elswhere but we haven't noticed yet [18:44:41] lemme check one elsewhere [18:45:14] hm no [18:45:20] seems to only be rthis container [18:45:28] elukey: mind if we try killing the one container and see what happens? [18:45:50] ps aux | grep container_e25_1601916545561_83072_01_000152 | grep stderr [18:45:51] nono please go [18:46:11] kill 33770 [18:49:07] elukey: i guesus we have to manually remove that stderr file [18:49:20] i'm gonig to hdfs dfs -put it into /tmp, ok? [18:49:21] and then remove iit [18:49:28] in case we need it in the next day or two to look at? [18:50:05] you can also truncate -s 1G or similar, so the file will stay there but the bulk of it will disappera [18:50:12] hmmm ok [18:50:26] (not sure if we remove that stderr if something will get upset, otherwise a rm is fine) [18:50:51] strangly luca at the top of that file i see [18:50:52] org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error [18:50:56] a bunhc of that [18:52:26] very strange [18:52:53] ok truncating to 1g [18:52:58] but the buik is the Avro stuff right? [18:53:02] yes [18:53:06] nice [18:53:19] lets see if that task starts up somewhere else and does the same thing... [18:53:29] +1 [18:53:42] RECOVERY - Disk space on Hadoop worker on an-worker1101 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:53:44] we can send an email to Goran with the stacktrace [18:53:54] if it re-happens [18:53:58] k [18:54:08] going to log off for today, ttl! :) [18:54:15] l8rs [18:57:59] 10Quarry: Provide option to download all user queries - https://phabricator.wikimedia.org/T192209 (10zhuyifei1999) a:05zhuyifei1999→03None [19:44:23] nice troubleshooting elukey and ottomata - thanks a lot for caring :) [19:46:45] 10Analytics, 10Research, 10Research-collaborations: Performance Issues when running Spark/Hive jobs via Jupyter Notebooks - https://phabricator.wikimedia.org/T258612 (10Aroraakhil) Thanks for all the help. Yes, the ticket can be closed! :) [19:52:39] (03CR) 10Zhuyifei1999: multiinstance: Attempt to make quarry work with multiinstance replicas (034 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [20:07:58] 10Analytics, 10Analytics-Wikistats, 10Product-Analytics: Mysterious anonymous content page creations on English Wikipedia according to stats.wikimedia.org - https://phabricator.wikimedia.org/T266578 (10JAllemandou) I did a quick check for month 2020-09: ` spark.sql(""" SELECT (caused_by_user_id IS NULL) as... [20:08:08] Gone for tonight [20:17:45] byeeee [21:20:11] 10Analytics: Check home/HDFS leftovers of leila - https://phabricator.wikimedia.org/T264994 (10leila) Thank you! I didn't check and I trust your work here. If something goes deleted unintentionally, I'm sure we can recover. [21:21:08] 10Analytics: Check home/HDFS leftovers of rodolfovalentim - https://phabricator.wikimedia.org/T266467 (10leila) Thanks, elukey. @diego I wont' be able to comment here. Please make the call. [22:12:04] helloooo nuria there's a bunch of entropy alarms coming in, is that you? [22:27:27] (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (032 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [22:43:41] fdans: looking [22:45:54] fdans: yes, the way the system works the new runs are sending old alarms, i think (as they look for the presence of afile) [22:46:59] fdans: apologies again but now i understand what is going on [22:47:18] nuria: is there a place where we have documented what to do with those alarms? [22:47:40] fdans: yes, one sec. [22:56:40] fdans: ah, no, the system is documented but not the "what to do" [22:56:44] fdans: i can do that [22:56:54] fdans: https://wikitech.wikimedia.org/wiki/Analytics/Data_quality/Traffic_per_city_entropy [22:57:07] fdans: will put a brief blurb on oncall page [22:57:44] nuria: thank you for doing that :) [23:22:46] 10Analytics: Check home/HDFS leftovers of rodolfovalentim - https://phabricator.wikimedia.org/T266467 (10Dzahn) Is it intended that Rodolfo is in the LDAP nda group? [23:29:18] fdans: https://wikitech.wikimedia.org/wiki/Analytics/Ops_week#Entropy_alarms [23:29:53] nuria: niiiice thank you!!! [23:30:09] 10Analytics: Check home/HDFS leftovers of rodolfovalentim - https://phabricator.wikimedia.org/T266467 (10Nuria) @Dzahn : i do not think so, he should be removed from LDAP [23:41:52] (03PS4) 10Nuria: Adding quality alarms for mobile app data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/633579 (https://phabricator.wikimedia.org/T257692) [23:45:36] (03PS5) 10Nuria: Adding quality alarms for mobile app data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/633579 (https://phabricator.wikimedia.org/T257692) [23:46:31] mforns: all tested and changes in [23:51:56] mforns: code is gerrit, mega tested