[05:09:13] 10Analytics: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10MMiller_WMF) [05:11:26] 10Analytics: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10MMiller_WMF) @Nuria -- the Growth and Android teams are both currently prototype an image suggestion algorithm with @Miriam in {T256081}. This task would unlock a potential route to increased accuracy in the algorithm.... [06:13:38] good morning [07:02:54] RECOVERY - Hue Gunicorn Python server on an-tool1009 is OK: PROCS OK: 10 processes with args /usr/lib/hue/build/env/bin/python3.7 /usr/lib/hue/build/env/bin/hue rungunicornserver https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration [07:03:02] RECOVERY - Hue Kerberos keytab renewer on an-tool1009 is OK: PROCS OK: 1 process with args /usr/lib/hue/build/env/bin/python3.7 /usr/lib/hue/build/env/bin/hue kt_renewer https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration [07:03:23] yessss [07:07:31] it's rare to see people excited about keytabs on a Monday morning :-) [07:09:17] ahahha [07:11:11] bbiab [07:49:27] 10Analytics-Clusters: Upgrade to Superset 0.37.x - https://phabricator.wikimedia.org/T262162 (10elukey) Upstream fixed the two problems and cancelled the current 0.37.2rc1 vote, in theory a new version with fixes should come out soon to get voted/tested again (and possibly released). This will take some days for... [07:58:46] so I am testing another round of superset dashboard, and the feeling is that all of them are requesting a big volume of data [07:58:52] and hence they are very slow to load [07:59:05] in fact I see a lot "Query timeout - visualization queries are set to timeout at 60 seconds" [08:03:29] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10elukey) [08:20:47] Morning! Will send reimage 24h warning for 1005 in a bit [08:22:45] klausman: 1006 right? :D [08:22:48] good morning [08:23:10] if possible let's do 48h [08:23:29] to Wed 23rd morning [08:23:31] *so [08:40:59] yes 1006 :) [08:41:07] Ok, Wednesday it is [08:42:32] We can probably schedule also 1007 in the same email, like Friday? [08:42:39] Sounds good. [08:50:32] sent. [08:55:25] Good morning [08:58:55] bonjour! [09:00:39] so, hue-next.wikimedia.org is now working and only using CAS [09:00:47] \o/ [09:00:47] no more LDAP auth (native from hue) [09:01:03] and users are auto-created upon first successful login [09:02:08] with current hue.wikimedia.org, we don't have cas in front and we rely on hue to use LDAP for auth [09:02:20] plus we manually sync from ldap for every user that needs to be added [09:02:41] with the new settings, in theory there are two use cases that are worth to discuss [09:03:01] This is super awesome elukey :) [09:03:05] 1) user in the 'nda' ldap group but without any privatedata posix membership accessing hive [09:03:12] 2) user in the 'nda' ldap group but without any privatedata posix membership accessing oozie [09:03:23] (I think it is the worst use case) [09:03:36] in 1), in theory Kerberos should prevent any data leak [09:03:55] since hue acts as proxy, and if the kerberos account is not there, no data fetch [09:04:02] yup [09:04:05] 2) will be fixed via https://phabricator.wikimedia.org/T262660 [09:04:25] but it doesn't seem to be a huge risk atm [09:04:36] (to stop hue's upgrade I mean) [09:07:06] just to confirm, from an-master1001 [09:07:36] INFO FSNamesystem.audit: allowed=true ugi=elukey (auth:PROXY) via hive/an-coord1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS) ip=/10.64.21.104 cmd=getfileinfo src=/wmf/data/wmf/projectview/hourly/year=2017/month=1/day=31/hour=7 dst=null perm=null proto=rpc [09:44:27] fighting debian today, so in&out of irc [09:44:38] (new laptop setup) [09:44:45] 10Analytics, 10Analytics-Kanban: Check that mediawiki-events match mediawiki-history changes over a month - https://phabricator.wikimedia.org/T262261 (10JAllemandou) Last news on `revision_create` for `simplewiki` 2020-07: - All kafka-events match mediawiki-history except the ones with `deleted-parts` (mostly... [10:35:20] * elukey lunch! Be back in ~2h [10:48:26] (03CR) 10Joal: "Comments' comments - Code is good :)" (031 comment) [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/628447 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [10:53:05] 10Analytics, 10observability: Indexing errors / malformed logs for aqs on cassandra timeout - https://phabricator.wikimedia.org/T262920 (10JAllemandou) After talking with @elukey we're not sure if there is anything we want to do here. The error happened when Luca roll-restarted the AQS hosts. @Pchelolo : Do yo... [11:17:53] (03PS1) 10Joal: Add page_props and user_properties to sqoopable tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628770 (https://phabricator.wikimedia.org/T258047) [11:56:53] goood morning team <3 [11:56:59] oh damn [11:57:43] Hi fdans :) [11:57:50] welcome! [11:58:21] hellooo joal I missed you!! [11:58:34] So have I fdans :) [11:58:41] :D [12:38:25] (03PS2) 10Joal: Add page_props & user_properties to sqoop/hive/oozie [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628770 (https://phabricator.wikimedia.org/T258047) [12:39:01] 10Analytics, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10JAllemandou) a:03JAllemandou [12:39:54] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10JAllemandou) [12:41:08] welcome back fdans ! [12:41:29] hellooo elukey thank you [12:41:33] it's nice to be back [12:44:28] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Stats for newer projects not available - https://phabricator.wikimedia.org/T258033 (10JAllemandou) Just checked: those projects are now available on labsdb as well as the analytics replica. Adding them to the sqoop-list. [12:47:51] (03PS1) 10Joal: Add 3 new projects to the sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628791 (https://phabricator.wikimedia.org/T258033) [12:48:32] fdans: just added you as a reviewer as a welcome back ;) [12:48:45] joal: yep! looking [12:51:50] (03CR) 10Fdans: "microcorrection sorry joal" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628791 (https://phabricator.wikimedia.org/T258033) (owner: 10Joal) [12:52:49] (03CR) 10Joal: Add 3 new projects to the sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628791 (https://phabricator.wikimedia.org/T258033) (owner: 10Joal) [12:53:31] thanks fdans ) [12:53:32] (03PS2) 10Joal: Add 3 new projects to the sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628791 (https://phabricator.wikimedia.org/T258033) [12:53:51] (03CR) 10Fdans: [V: 03+2 C: 03+2] "let's go!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628791 (https://phabricator.wikimedia.org/T258033) (owner: 10Joal) [13:11:16] 10Analytics, 10Event-Platform: jsonschema-tools should fail if new required field is added - https://phabricator.wikimedia.org/T263457 (10Ottomata) [13:19:49] (03CR) 10Ottomata: [WIP] Add option to use Wikimedia EventStreamConfig to get kafka topics to ingest (031 comment) [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/628447 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:30:39] 10Analytics-Radar, 10Datasets-General-or-Unknown, 10Product-Analytics, 10Structured-Data-Backlog (Current Work): Set up generation of JSON dumps for Wikimedia Commons - https://phabricator.wikimedia.org/T259067 (10Cparle) [13:31:15] 10Analytics-Radar, 10Datasets-General-or-Unknown, 10Product-Analytics, 10Structured-Data-Backlog (Current Work): Set up generation of JSON dumps for Wikimedia Commons - https://phabricator.wikimedia.org/T259067 (10Cparle) a:03Cparle [13:32:37] fdans: as welcome back gift we have a new version of Hue for you :D [13:32:53] hell yea [13:34:14] fdans: yooohooo welcome back [13:35:17] ottomata: helloooooo andrewwww [13:42:47] are you in txs now!? [13:48:17] I can't hear him, so probably [14:19:29] 10Analytics-Clusters, 10Analytics-Kanban: Add more metrics to prometheus-amd-rocm-stats Python script - https://phabricator.wikimedia.org/T262427 (10elukey) p:05Triage→03Medium [14:20:58] 10Analytics-Clusters, 10Analytics-Radar, 10User-Elukey: Monitoring GPU Usage on stat Machines - https://phabricator.wikimedia.org/T251938 (10elukey) 05Open→03Resolved @Aroraakhil yep it should be fine, during the next releases they changed the script to use less privileges, so all good! Thanks for the pa... [14:22:15] heya teammm [14:22:23] welcome back fdans! [14:22:28] 10Analytics-Clusters, 10Jupyter-Hub: Timeout during relaunch Jupyterhub server - https://phabricator.wikimedia.org/T258087 (10elukey) 05Open→03Invalid Please re-open if necessary :) [14:38:13] yay, fdans is back! Welcome welcome, hope you got to chill a bit [14:38:51] As far as I am informed, Texas tends to *not* be particular chill in September :) [14:45:54] So much fighting ancient Python programs today :-S My brain is mush [14:49:13] klausman: you would be surprised! pretty rainy and kinda chilly today [14:49:19] hellooo milimetric and mforns [14:50:26] 10Analytics, 10Event-Platform: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set headers values in event data - https://phabricator.wikimedia.org/T263466 (10Ottomata) [14:50:41] the ancients take a toll, I'm gonna go hang out in the cave early if anyone wants to socialize [14:53:00] 10Analytics, 10Event-Platform: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set headers values in event data - https://phabricator.wikimedia.org/T263466 (10Ottomata) [14:54:24] ottomata: o/ if you are ok I'd deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/628850 tomorrow morning (EU quiet time) [14:54:57] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set headers values in event data - https://phabricator.wikimedia.org/T263466 (10Ottomata) [14:54:59] _1 [14:55:01] +1 elukey [14:55:19] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set headers values in event data - https://phabricator.wikimedia.org/T263466 (10jlinehan) [14:55:20] works fine in hadoop test afaics, if it works good also for presto I'll do the same for hadoop the next day [14:55:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10Nuria) @MMiller_WMF change is on the works, will be effective with the next mediawiki snapshot [15:01:32] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Fix TLS certificate location and expire for Hadoop/Presto/etc.. and add alarms on TLS cert expiry - https://phabricator.wikimedia.org/T253957 (10elukey) [15:10:59] how do I get to the oozie running job interface in hue-next [15:11:19] there is a oozie button on the left column [15:11:49] and then you need to check the filter since it automatically lists only the jobs running with you username only [15:12:56] (03PS1) 10Joal: Add 3 new projects to the sqoop list - bis [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628854 (https://phabricator.wikimedia.org/T258033) [15:13:26] elukey: that only seems to be an oozie editro? [15:13:33] fdans: --^ just as you're so fast ;) [15:14:43] OHHH jobs [15:14:44] got it [15:14:52] perfect this is great [15:15:38] (03CR) 10Fdans: [V: 03+2 C: 03+2] "niiice" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628854 (https://phabricator.wikimedia.org/T258033) (owner: 10Joal) [15:19:02] elukey: here's a weird hue-next thing [15:19:20] when I click on 'Configuration' for an oozie job [15:19:35] i get sent to the 'old' hue ui, which errors, and then redirects me back to the hive editro [15:19:43] editor* [15:19:44] oh my [15:19:52] do you have a link? [15:20:03] 10Analytics, 10Event-Platform, 10Privacy Engineering, 10Product-Infrastructure-Data, and 3 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10jlinehan) p:05High→03Low [15:32:32] elukey: https://hue-next.wikimedia.org/hue/jobbrowser/#!id=0001511-200915132022208-oozie-oozi-C [15:32:35] and then Confniguration in upper rigth [15:33:14] yeah, I see only a failure in loading the page (so I'll investigate), but not the fallback to hue 3's ui [15:42:57] !log manually killing wikidata-json_entity-weekly-wf-2020-08-31 - Raw data is missing from dumps folder (json dumps) [15:42:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:45:00] !log Restart wikidata-json_entity-weekly coordinator after wrong kill in new hue UI [15:45:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:46:12] 10Analytics-Radar, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10Zbyszko) [15:49:05] Found a problem with hue-next elukey :)( [15:50:49] joal: yep I know there would be problems, please let me know how to repro :) [15:52:01] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set header values in event data - https://phabricator.wikimedia.org/T263466 (10Ottomata) [16:02:03] ottomata: https://github.com/cloudera/hue/issues/1270 [16:02:17] hopefull the fix is easy :) [16:03:45] ping ottomata elukey , looks like klausman is not invited to larger SRE meeting [16:03:51] ah snap [16:04:24] klausman: you should have an email now [16:05:01] perfect I see you :) [16:18:18] 10Analytics, 10Growth-Team, 10Product-Analytics: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10Nuria) We should understand the reasons that lead to these events to be lost , 1% seems a lot of events to be missing. Pinging @WDoranWMF about this cause we could... [16:21:33] !log Kill restart wikidata-item_page_link-weekly-coord to not wait on missing data [16:21:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:30:55] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) a:05fdans→03mforns [16:33:56] 10Analytics, 10Event-Platform, 10Product-Analytics (Kanban): Product Analytics to review & provide feedback for Event Platform Instrumentation How-To - https://phabricator.wikimedia.org/T253269 (10LGoto) [16:34:05] 10Analytics, 10Event-Platform, 10Product-Analytics (Kanban): Product Analytics to review & provide feedback for Event Platform Instrumentation How-To - https://phabricator.wikimedia.org/T253269 (10mpopov) [16:34:35] 10Analytics, 10Event-Platform, 10Product-Analytics (Kanban): Product Analytics to review & provide feedback for Event Platform Instrumentation How-To - https://phabricator.wikimedia.org/T253269 (10mpopov) [16:35:15] 10Analytics, 10Event-Platform, 10Product-Analytics (Kanban): Product Analytics to review & provide feedback for Event Platform Instrumentation How-To - https://phabricator.wikimedia.org/T253269 (10mpopov) 05Open→03Resolved Jason & I reviewed the documentation and have updated it where needed. Marking as... [16:35:19] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Product-Analytics: Write and update Event Platform instrumentation documentation for Product teams - https://phabricator.wikimedia.org/T233329 (10mpopov) [16:36:18] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10mpopov) [16:36:47] 10Analytics-Radar, 10Growth-Team (Current Sprint), 10Product-Analytics (Kanban): Newcomer tasks: update schema whitelist for Guidance - https://phabricator.wikimedia.org/T255501 (10nettrom_WMF) 05Open→03Resolved I've verified that this change has been deployed. The NewcomerTask schema is available in a s... [16:39:29] 10Analytics-Radar, 10Product-Analytics, 10Product-Infrastructure-Team-Backlog, 10Epic: Re-define what constitutes a mobile pageview - https://phabricator.wikimedia.org/T257277 (10LGoto) [16:42:49] 10Analytics-Radar, 10Product-Analytics, 10Product-Infrastructure-Data, 10Wikipedia-Android-App-Backlog, and 2 others: [EPIC] Count unique iOS & Android users precisely and in a privacy conscious manner that does not require opt in to send data - https://phabricator.wikimedia.org/T202664 (10kzimmerman) [16:46:27] 10Analytics-Radar, 10Product-Analytics, 10Product-Infrastructure-Data, 10Wikipedia-Android-App-Backlog, and 2 others: [EPIC] Count unique iOS & Android users precisely and in a privacy conscious manner that does not require opt in to send data - https://phabricator.wikimedia.org/T202664 (10LGoto) a:05SNow... [16:51:18] 10Analytics-Radar, 10Product-Analytics: Collect metrics/tables which might be touched by IP masking feature - https://phabricator.wikimedia.org/T255816 (10jwang) 05Open→03Resolved The info collection is done. So close it. [16:59:53] !log Manually add _SUCCESS file to events to hourly-partition of page_move events so that wikidata-item_page_link job starts [16:59:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:00:14] milimetric: it happened faster than I would have said --^ ;) [17:01:28] joal: on the CR for page_props [17:01:41] joal: isn't there also a step about "updating partitions" [17:01:50] joal: for each table we import? [17:01:54] milimetric: since now our timezones almost completely overlap we can do pairing on wikistats anytime you want, if you feel like it :) [17:02:09] ack nuria - no emergency - the update partition as well as table creation are all bundled in the patch [17:02:20] ottomata: I manually fixed the Configuration panel on an-tool1009, let me know if it works now.. I'll have to send a pull request to upstream [17:02:21] joal: k [17:03:26] it works elukey! [17:03:31] elukey: i don't remember if this worked before or not [17:03:37] but the Logs tab for oozie jobs doesn't show anything [17:03:43] it might not have previously either... [17:04:57] hmm it kinda works on some workflows [17:05:08] anyway, just FYI, i don't really rely on that anyway [17:07:19] PROBLEM - Disk space on Hadoop worker on an-worker1084 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:07:52] (03CR) 10Joal: "Sqoop job tested, table creation as well. Oozie-load update tested." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628770 (https://phabricator.wikimedia.org/T258047) (owner: 10Joal) [17:10:44] mmmmmm no bueno the disk alert [17:11:25] elukey: we're computing a new text snapshot :( [17:11:36] so from https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=103&orgId=1 it seems that only one is getting bad [17:11:44] that is strange elukey [17:11:46] looking ntoo [17:12:33] we are again around the 2PB mark https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=25&orgId=1&from=now-7d&to=now [17:12:53] so I am pretty sure that other workers will follow if we keep adding data [17:13:26] it is very strange that only one dsik on this node is so full [17:14:56] maybe that is a specific yarn container using it [17:14:57] elukey waht is /wmf/data/learning? 22T [17:14:58] do you knonw? [17:15:22] ottomata: rollup data on automated traffic, but user-signature [17:15:32] s/but/by/ [17:15:56] learning?? [17:16:39] ottomata: yes [17:16:39] HMMMM [17:16:42] ottomata: as in "ML" [17:16:47] ottomata: cause it is really features [17:16:51] it looks like we are not purign wmf raw mediawiki_job [17:16:52] ottomata: for ML [17:16:53] oh ML [17:16:54] haha [17:17:14] (...not very clear butoook! :p ) [17:17:27] razzi: are you around? [17:17:33] Yup [17:18:04] so just to give you some intro - the above alarm is about one single partition, out of the 12 that we have for hdfs data, getting full [17:18:09] on a single worker node [17:18:52] usually when this happens it might mean that the HDFS file system is getting bigger as well, or that a specific app is causing problems [17:18:53] joal: i see both wikidata and wikibase dirs, is wikibase the wikidata dump import? [17:19:09] razzi: in https://grafana.wikimedia.org/d/000000585/hadoop you have a good overview of all the daemons [17:19:11] (03CR) 10Nuria: [C: 03+2] "Since workflow is tested let's please merge." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628770 (https://phabricator.wikimedia.org/T258047) (owner: 10Joal) [17:19:29] razzi: under Namenode metrics (the hadoop hdfs master) you have some info about total HDFS space etc.. [17:19:49] raw/wikibase has 4 snapshots [17:19:55] snapshot=2019-11 snapshot=2019-12-20 snapshot=2020-01-06 snapshot=2020-01-28 [17:20:12] andn uses a lot of space [17:20:12] 57.1 G 171.3 G /wmf/data/raw/wikibase [17:20:18] OH [17:20:19] sorry [17:20:20] neverminnd [17:20:21] that is G [17:20:24] not T [17:20:33] conflated one line with another [17:21:00] ottomata: raw/wikidata are dumps, raw/wikibase are sqooped tables [17:21:39] ottomata: I'm gonna drop some data, I have a few TB in my home folder [17:22:58] setting up purge job for mediawki_job raw data [17:23:03] 18.5 T 55.6 T /wmf/data/raw/mediawiki_job [17:23:17] ottomata: kept indefinitely? [17:23:18] we also need to drop netflow raw data [17:23:57] oh, i'll add that too [17:24:15] we should probably have these purge jobs delcared as part of the camus_job wrapper... [17:24:32] joal: i see job eventns from 2019 [17:25:45] 10Analytics-Radar, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): PoC on anomaly detection with Flink - https://phabricator.wikimedia.org/T262942 (10CBogen) [17:26:30] RECOVERY - Disk space on Hadoop worker on an-worker1084 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:26:33] nuria: from your CR on page_props: workflow is the thing that has not been tested, right [17:26:43] good [17:28:02] /dev/sdm1 3.6T 3.3T 320G 92% /var/lib/hadoop/data/m [17:28:05] /dev/sdg1 3.6T 3.6T 36G 100% /var/lib/hadoop/data/g [17:28:08] situation is not really great yet [17:28:15] :( [17:28:37] also on the same node most of the partitions are around 92/94% of usage [17:28:47] so I think it is more a hdfs global usage [17:29:00] right [17:29:17] elukey: do we have rack equity? [17:29:25] elukey:, ottomata : netflow data is not dropped because it is not ingested as part of teh regular events [17:29:29] joal: what do you mean? [17:29:41] elukey:, ottomata : adding a timer to drop it shoudl be easy [17:29:45] ya just like mediawiki_job [17:29:46] elukey: cause if we don't and haddop tries to optimize blocks over racks, some might be busier than others [17:29:47] i'm doingn thta nonw [17:30:13] joal: we have not a perfect split between rows but a reasonable one [17:30:13] waitign for refinery-drop-older-than to give me a checksum for mediawiki_job... [17:30:29] ottomata, elukey : https://phabricator.wikimedia.org/T231339 [17:30:51] 10Analytics, 10Analytics-Kanban: Set up automatic deletion/snatization for netflow data set in Hive - https://phabricator.wikimedia.org/T231339 (10Nuria) [17:30:59] 10Analytics, 10Analytics-Kanban: Set up automatic deletion/snatization for netflow data set in Hive - https://phabricator.wikimedia.org/T231339 (10Nuria) moving to kanban [17:31:14] nuria: yes I brought it up some time ago but we wanted to try the eventgate onboarding first, that didn't proceed due to some things to solve etc.. [17:31:17] elukey: i'm lookingn at the [17:31:19] /wmf/data/raw/netflow/netflow/hourl [17:31:20] data [17:31:26] ack! [17:31:27] not /wmf/data/wmf/netflow [17:31:33] it shoudl be ok to purge the raw stuff, right? [17:31:44] in theory yes, we don't really need to do any recompute [17:31:49] k [17:34:51] elukey@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -du -h /var/log/hadoop-yarn [17:34:54] 51.3 T 153.8 T /var/log/hadoop-yarn/apps [17:34:59] hello hello [17:36:10] i wonndner if we can set up a purge job for that too [17:36:35] I think that joal found some days ago that we keep yarn logs for 90d, and we forgot to follow up [17:36:43] should we decrease it to say 30d? [17:37:46] we have in puppet $yarn_log_aggregation_retain_seconds = 7776000, [17:37:57] elukey: I think it makes sense to do so - As per klausman saying that keeping some more to compare regular stuff, I'd say 40d (have 2 beginning month for some time) [17:38:06] +1 [17:38:30] we gotta remember that more hadoop usage on only means more compute usage...BUT MORE HDFS USAGE FROM LOG DATA! [17:38:33] so 3456000 ? [17:38:44] sending a patch [17:40:07] https://gerrit.wikimedia.org/r/628887 [17:40:37] razzi: please feel free to ask quesitions :) [17:41:07] at this point what we are trying to do is lower the retention of yarn logs (any job sent to the cluster stores some) [17:41:56] the two values above, 51.3 T 153.8 T /var/log/hadoop-yarn/apps, are related to the size of the data (51TB) vs its 'real' replicated value in the cluster (3 times since we have 3 replicas for each block) [17:42:46] joal,ottomata - ok if I go with https://gerrit.wikimedia.org/r/c/operations/puppet/+/628887 ? [17:42:57] I guess that only resource managers need to be restarted [17:43:08] +1 [17:43:42] it checks once a day so the clean up will not be immediate [17:44:19] !log restart yarn resource managers on an-master100[1,2] to pick up settings for https://gerrit.wikimedia.org/r/c/operations/puppet/+/628887 [17:44:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:44:39] mforns: does refinery-drop-older-than have to finish evaluating all the dry-run would have droppped dirs to give me the checksum? [17:44:53] it seems to be hanging...been waiting for a checksum for 10-15 minnutes [17:46:25] ottomata: yes [17:46:27] one sec [17:47:26] (RM on 1001 restarted) [17:47:59] ottomata: are you trying to purge several data sets at once? [17:48:05] no [17:48:08] well [17:48:14] yes? everyrhing in /wmf/data/raw/mediawiki_job [17:48:16] ottomata: or else lots of partitions? [17:48:20] yeah probably a lot [17:49:10] ottomata: yes, all in /wmf/data/raw/mediawiki_job is too much [17:49:31] that is a problem in the script, that I thought wouldn't be so annoying, but it is... [17:49:47] the thing is that, in production, when actually deleting the data, that won' [17:49:53] won't happen [17:50:05] but the dry run works like this [17:50:13] we can fix that, with a bit of code [17:50:39] ya [17:50:40] ? [17:50:55] yes, but would take a bit [17:50:57] (1002 restarted) [17:51:12] ottomata: when's the deadline for what you're deleting? [17:52:23] 90 days [17:52:37] mforns: [17:52:39] commadn is [17:52:40] refinery-drop-older-than --base-path='/wmf/data/raw/mediawiki_job' --path-format='.+/hourly/(?P[0-9]+)(/(?P[0-9]+)(/(?P[0-9]+)(/(?P[0-9]+))?)?)?' --older-than='90' --skip-trash [17:53:46] ottomata: ok, but I meant, would it be OK if you wait 1 day to setup that deletion? [17:53:57] oh hm i think so, maybe? elukey [17:54:00] this way I could look at the script today [17:54:08] if we have to wait the yarn log deletion shoudl hold us over, rifhgt? [17:54:25] i could also just manually delelte anythign older than 2020 [17:54:26] that would be easy [17:54:37] yeah I think it is fine [17:56:13] ottomata: ok, then I will look into those changes, hopefully tomorrow I have sth [17:58:09] thakn you mforns ! [17:58:23] np [18:00:12] !log execute sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mgerlach/logs/* to free ~30TB of space on HDFS (Replicated) [18:00:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:00:22] I asked beforehand for --^ [18:01:22] +1 from me elukey [18:01:30] oh sorry you askd mgerl gog it [18:01:32] got it [18:01:45] yep! https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=103&orgId=1&from=now-3h&to=now looks better! [18:13:16] ottomata: ok if I log off? Anything that I can help with? I saw your patch about raw data drops, it looks good [18:13:20] thanks a lot [18:13:48] elukey: looks good [18:13:50] thank you [18:13:52] l8rs! [18:14:33] o/ [18:18:57] 10Analytics, 10Platform Engineering, 10Epic, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Pchelolo) [18:29:37] ottomata: you know if the deletion script was at partition-deletion time or at directory-deletion time? Did it already log all the partitions to delete? [18:32:00] I mean, if you know at what point did it get stuck [18:33:20] 10Analytics, 10Platform Engineering, 10Epic, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Pchelolo) [18:35:54] mforns: i think nit logged all the directories [18:35:59] there aren't anyy hive tables mapped onto this data [18:36:15] ah ok [18:37:45] milimetric: remember I promised to write up about AQS. I wrote up about AQS. T263489 would like to know your opinion on the technical part of it [18:37:45] T263489: AQS 2.0 - https://phabricator.wikimedia.org/T263489 [18:41:20] fdans: I’m very happy about that, sorry just got out of meeting, eating some and taking a breath, but we have forever to make sweet code together :) [18:44:16] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Stats for newer projects not available - https://phabricator.wikimedia.org/T258033 (10The_Discoverer) >>! In T258033#6479651, @gerritbot wrote: > Change 628791 had a related patch set uploaded (by Joal; owner: Joal): > [analytics/refinery@master] Add 3... [18:52:45] milimetric: yessss [19:06:03] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10egardner) a:03egardner [19:20:50] (03PS1) 10Joal: Add new projects to sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628917 [19:21:10] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Stats for newer projects not available - https://phabricator.wikimedia.org/T258033 (10JAllemandou) @The_Discoverer - Done :) [19:24:44] (03CR) 10Joal: [V: 03+2] "Merging after +2" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628770 (https://phabricator.wikimedia.org/T258047) (owner: 10Joal) [19:37:18] Ok I'm done for tonight team - see you tomorrow [19:46:18] ottomata: https://github.com/wikimedia/jsonschema-tools/issues/21 -- your answer may be that I'm doing it wrong :) [20:09:55] oo thankns bd808 commented [20:09:55] def a bug. [20:10:05] Q! what are you using for, just trying it out? [20:17:29] 10Analytics: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10mforns) [20:17:38] 10Analytics, 10Analytics-Kanban: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10mforns) [20:19:16] (03PS1) 10Mforns: Improve path discovery in drop-older-than [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) [20:21:02] 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) [20:22:38] ottomata: when you have a minute I'd love some thoughts re: T263496 [20:22:39] T263496: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 [20:23:14] cdanis: i really want to click on that link adn comment but i have to do some practice run throughs for my talk tomorrow and i am way unprepared! :) [20:23:18] i will try to get to it after that [20:23:20] ah okay [20:23:25] gonna sign off IRC for now though! [20:23:26] :D [20:23:26] no worries :) [20:23:28] ttyl! [20:23:43] ottomata: I'm working on a new project called Toolhub. It will be using json schema files to validate content that it crawls from submitted URLs. More info at https://meta.wikimedia.org/wiki/Toolhub if you are interested. [20:43:58] bd808: you just missed him :) [20:51:49] 10Analytics, 10Analytics-Kanban: Check that mediawiki-events match mediawiki-history changes over a month - https://phabricator.wikimedia.org/T262261 (10Nuria) [20:51:52] 10Analytics, 10Growth-Team, 10Product-Analytics: Revisions missing from mediawiki_revision_create - https://phabricator.wikimedia.org/T215001 (10Nuria) [21:04:22] 10Analytics-Radar, 10Technical-blog-posts: Story idea for Blog: The Best Dataset on Wikimedia Content and Contributors - https://phabricator.wikimedia.org/T259559 (10srodlund) @Millimetric Great! Since we have 2 posts going up this week, I am going to publish this one early next week if that works for you. [21:07:00] (03CR) 10Nuria: [C: 03+2] Add new projects to sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628917 (owner: 10Joal) [21:11:32] (03CR) 10Nuria: "adding razzi to review" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [21:12:42] (03CR) 10Mforns: [C: 04-2] "Still testing! Please, do not merge :]" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [21:12:43] mforns: yt? [21:12:57] nuria: yes [21:13:01] mforns: I am not sure i understand the problem with https://gerrit.wikimedia.org/r/c/analytics/refinery/+/628933/1/bin/refinery-drop-older-than [21:13:14] cc razzi (who i added to CR) [21:13:30] yes, I know my explanation is weak, but it's kinda tough to explain... [21:13:38] mforns: shouldn't the hdfs.ls(directories_to_analyze, include_children=False) [21:13:48] be hdfs.ls(directories_to_analyze, include_children=True)? [21:13:50] the problem with the script is in dry-run time [21:14:21] nuria: no, include_children=True would navigate all directory tree, and could be too much [21:14:46] if we are, say, ls-ing the root of a 80k directory directory-tree [21:16:25] so, the problem was: the way the script was ls-ing: with globs like /base/path/*/*/*/* [21:18:08] mforns: but the problem is that an 80, 000 list is returned ? [21:18:16] mforns: or the recursion over the list? [21:18:26] nuria: no, hdfs can not take it [21:18:48] mforns: i see hdfs ls will itself fail [21:18:54] yes [21:19:35] mforns: ok but your fix does not fix that right? ( I might be totally off) [21:22:25] mforns: * it seems* that if the argument passed to teh script makes hdfs.ls fails that is OK, we catch that and say oops [21:23:05] hm, but the ls is needed to select all paths that need to be deleted [21:23:35] mforns: i think i am missing something [21:23:49] wanna bc? [21:23:54] ya [21:24:01] cc razzi in case he wants to join [21:24:14] yea! [21:24:34] mforns: bc [21:24:37] razzi: yt? [21:25:17] Yeah, joining [21:59:54] 10Analytics-Data-Quality, 10Analytics-EventLogging, 10Analytics-Radar, 10Product-Analytics, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10kaldari) [22:16:00] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics, 10Documentation: Document how ad blockers / tracking blockers interact with EventLogging - https://phabricator.wikimedia.org/T263503 (10kaldari) [22:19:45] 10Analytics-Data-Quality, 10Analytics-EventLogging, 10Analytics-Radar, 10Product-Analytics, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10Jdlrobson) > Someone is using a phone or tablet to edit with the Wikitext editor on the... [22:24:30] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics, 10Documentation: Document how ad blockers / tracking blockers interact with EventLogging - https://phabricator.wikimedia.org/T263503 (10kaldari) [22:27:04] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics, 10Documentation: Document how ad blockers / tracking blockers interact with EventLogging - https://phabricator.wikimedia.org/T263503 (10kaldari) [22:27:43] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics, 10Documentation: Document how ad blockers / tracking blockers interact with EventLogging - https://phabricator.wikimedia.org/T263503 (10kaldari) [22:35:10] Signing off for the day, see y'all tomorrow [22:36:44] 10Analytics-Data-Quality, 10Analytics-EventLogging, 10Analytics-Radar, 10Product-Analytics, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10kaldari) @Jdlrobson - This task is about EventLogging via the EditAttemptStep schema, n... [22:37:17] 10Analytics-Data-Quality, 10Analytics-EventLogging, 10Analytics-Radar, 10Product-Analytics, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10kaldari) [23:01:23] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10egardner) I'm about to start working on adding the instrumentation for t... [23:46:02] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Nuria) @egardner probably a quick meeting with @nettrom_WMF or @jlineha... [23:50:13] 10Analytics-Data-Quality, 10Analytics-EventLogging, 10Analytics-Radar, 10Product-Analytics, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10Jdlrobson) The EditAttemptSchema hardcodes `phone` for the mobile editor - https://gerr...