[00:39:49] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10cchen) @JAllemandou @Milimetric thanks for confirming this with me. I confirm that PMs are aware the data related to platforms are non-additive an... [01:40:05] 10Analytics: Categorylinks dump might have some problem with the encoding - https://phabricator.wikimedia.org/T264850 (10Milimetric) @ArielGlenn is this something you'd know about or know who to point me to? [03:01:13] 10Analytics, 10Dumps-Generation, 10Wikidata, 10Wikidata-Query-Service: Categorylinks dump might have some problem with the encoding - https://phabricator.wikimedia.org/T264850 (10ArielGlenn) [03:05:45] 10Analytics, 10Dumps-Generation, 10Wikidata, 10Wikidata-Query-Service: Categorylinks dump might have some problem with the encoding - https://phabricator.wikimedia.org/T264850 (10ArielGlenn) echo -n ânești | od -t x1 0000000 c3 a2 6e 65 c8 99 74 69 You appear to be seeing a string representation of t... [03:54:07] 10Analytics, 10Dumps-Generation, 10Wikidata, 10Wikidata-Query-Service: Categorylinks dump might have some problem with the encoding - https://phabricator.wikimedia.org/T264850 (10ArielGlenn) >>! In T264850#6531377, @Milimetric wrote: > @ArielGlenn is this something you'd know about or know who to point me... [05:01:45] 10Analytics-Radar, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10Product-Analytics, and 2 others: Set up generation of JSON dumps for Wikimedia Commons - https://phabricator.wikimedia.org/T259067 (10ArielGlenn) [06:42:13] good morning [06:48:27] bonjour [06:51:58] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10JAllemandou) Thanks @cchen - Let's make the CR move for data to appear from next month :) [06:53:33] elukey: anything you'd need from me today? [06:54:00] joal: I am checking the under replicated blocks and it looks strange, https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=now-24h&to=now [06:54:42] ah wow the namenode failed over [06:54:54] intersting [06:55:24] elukey: I don't know if it's feasible, but it'd be great to have a message in chan telling us when the namenode or resourcemanager fails over [06:55:28] maybe [06:56:03] ok so from charts an-master1002 seems ok in terms of blocks [06:56:32] it might be a jmx weirdness [06:56:49] the main issue is that logs have rotated and I don't see anything related to why it failed over [06:57:44] :( [06:57:49] but yes we could have a simple icinga check that fires if say for a long time the master is not an-master1001 [06:58:05] elukey: I guess that log-rotation means we have lost the previous set? [06:59:05] it seems that we keep 3 rotations, and we go back up to a couple of hours AFTER the failover (I assume when the metrics change) [06:59:20] good thing is that I don't see any weird GC metric [07:00:20] joal: ok if I failover back to 1001? [07:00:39] please! [07:01:35] done, let's see if the jmx metric revoer [07:01:37] *recover [07:01:43] ack [07:03:55] yep recovered [07:04:07] perfect [07:04:25] elukey: is nodes decomissioning going ok? [07:04:35] !log failover from an-master1002 to 1001 for HDFS namenode (the namenode failed over hours ago, no logs to check) [07:04:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:05:01] joal: it is yes, I was about to exclude the third one when I noticed the weird metric [07:05:16] ok great - so about 1day per node to move blocks gently is it? [07:05:26] correct [07:05:30] great [07:06:13] * joal dreams of blocks moving in a grid [07:09:27] * elukey plays https://www.youtube.com/watch?v=S5S6s5dZXNM [07:11:37] * joal now has music in his dream :) [07:32:57] 10Analytics, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10elukey) @lexnasser you should now be able to ssh to the stat100x hosts (notebooks are not there anymore, deprecated, we copied your things... [07:34:26] joal: ah so we have a check that at least one namenode is active [07:58:52] !log decom analytics1044 from Hadoop [07:58:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:54:30] elukey: heya - do you have a minute? [08:57:13] sure [08:58:20] I'm testing distcp and I'm heading to a wall: In order to copy data owned by hdfs user (root for hdfs) and others, I need to run a MR job from hdfs - BUT: hdfs user is baned on yarn [08:59:35] yes correct, it is a setting that we have on yarn, but we can change it [08:59:42] I mean if it is needed we can allow it [09:00:42] elukey: For testing purposes, it'd be great :) [09:00:56] elukey: we can allow it on the test cluster, this for testing only [09:03:57] joal: the test cluster does not exist now, we have to re-create it with new hw [09:04:18] but it seems an ok use case to me, we'll need the hdfs user anyway no? [09:04:29] FYI, there's an Icinga alert for a failed systemd service on stat1006 (jupyterhub-iflorez-singleuser.service), some failing Python import, so probably related to the Buster update [09:06:11] we'll need it indeed elukey [09:06:21] moritzm: thanks yes! I think that Irene needs to update her venv [09:06:49] joal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/633151 [09:07:20] (for some reason we are now allowing 'nobody', that doesn't make sense) [09:07:29] ack [09:07:44] I don't know why we were allowing it - weird [09:07:48] I mean in order to launch a job as hdfs one needs to have krb auth, at that point we are already screwed [09:08:03] I am pretty sure it was my copy/paste mistake [09:16:31] 10Analytics: Check home/HDFS leftovers of rush - https://phabricator.wikimedia.org/T265121 (10MoritzMuehlenhoff) [09:16:32] joal: I am running puppet on all workers, then I'll roll restart the node managers [09:16:47] Ack elukey - thanks a lot for that, sorry for the extra work :( [09:17:31] elukey: I think it's good best-practive not to allow hdfs to run jobs - Let's try to think about removing it after the upgrade [09:18:03] joal: yeah but with kerberos I think it is not a big deal to run yarn containers as hdfs [09:18:14] 10Analytics, 10Dumps-Generation, 10Wikidata, 10Wikidata-Query-Service: Categorylinks dump might have some problem with the encoding - https://phabricator.wikimedia.org/T264850 (10Lucas_Werkmeister_WMDE) The encoding looks correct in my terminal: `lang=shell-session $ curl -s https://dumps.wikimedia.org/ro... [09:18:18] makes sense elukey [09:23:52] ah ouch, the config doesn't work [09:24:03] :S [09:25:46] no it is only yarn being stupid [09:25:54] as in? [09:26:45] it failed due to address already bound [09:27:11] so the crappy init.d scripts are not handling restarts very well all the times (like leaving the old process running) [09:27:26] I am doing "stop" + "restart" now, 3 nodes at the time [09:34:26] ok [09:34:38] elukey: would you please ping me when I can resume testing? [09:37:01] yep [09:38:16] I am still doing the roll restart, there was some other weird issue [09:38:25] no problem [09:47:02] 10Analytics-Clusters: Improve logging for HDFS Namenodes - https://phabricator.wikimedia.org/T265126 (10elukey) [09:47:06] sigh --^ [09:47:13] joal: green light [09:47:34] !log roll restart of hadoop-yarn-nodemanager on all hadoop workers to pick up new settings [09:47:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:54:30] ack elukey [09:56:10] elukey: I confirm the change has worked (hdfs can run jobs) [09:57:07] super [09:59:51] elukey: I also confirm distcp works as expected in term of ownership/writes/updates when using the same cluster as origin/dst [10:00:05] elukey: I plan on doing the test again once we have another cluster [10:01:11] 10Analytics: Increase in usage of /var/lib/mysql on an-coord1001 after Sept 21st - https://phabricator.wikimedia.org/T264081 (10elukey) The graph to check is https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&refresh=5m&var-server=an-coord1001&var-datasource=thanos&var-cluster=analytics... [10:02:55] joal: super [10:05:02] 10Analytics: Check home/HDFS leftovers of rush - https://phabricator.wikimedia.org/T265121 (10elukey) ` ====== stat1004 ====== total 38700 drwxrwxr-x 3 root root 4096 Aug 20 2018 08_20_2018_audit drwxr-xr-x 2 4610 wikidev 4096 Oct 5 23:40 bin -rw-rw-r-- 1 root root 19602 Sep 11 2018 dhistory... [10:39:01] 10Analytics-Clusters, 10Analytics-Kanban: Create the new Hadoop test cluster - https://phabricator.wikimedia.org/T255139 (10elukey) a:03elukey [11:15:10] !log bootstrap the Analytics Hadoop test cluster [11:15:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:22:51] wowo elukey - we already have a test-cluster? [11:24:13] joal: nono I am working on it, it needs some time :D [11:24:25] sure, but still, ther hosts! [11:25:16] my idea is to bootstrap it bare minimum, explain things to Razzi/Tobias and then work with them to complete it [11:25:22] for example, myself from the past wrote [11:25:23] https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop [11:26:05] the first time I broke the hdfs fs I had to restart from scratch and it was soooo painful [11:26:09] joal :[ help please (when you can) [11:26:30] hey mforns - I'm here [11:27:00] can we bc when you're finished with elukey? [11:27:11] sure, give me asminute [11:27:15] :] [11:33:21] ready mforns [11:35:08] elukey: sounds good - let me know when I can test distcp (no volume, only folders), :-P [11:57:30] 10Analytics, 10Dumps-Generation, 10Wikidata, 10Wikidata-Query-Service: Categorylinks dump might have some problem with the encoding - https://phabricator.wikimedia.org/T264850 (10marcmiquel) Thank you @ArielGlenn and @Lucas_Werkmeister_WMDE, So, to explain what I am doing ( https://pastebin.com/kPrwQ0Lb )... [12:00:59] going to the doctor, will be afk for a bit! ttl [12:49:42] joal: checked the asns given by pmacct against the ones given by maxmind, and: [12:50:33] about 60% of the requests are not present in maxmind ISP database [12:50:35] but [12:51:01] from those, 99% belong to the same asn, which probably netflow people already know [12:52:54] so, if we ignore this very common asn, the rest of the ip-asn pairs match 98.5% of the time. and when they do not match, it's because maxmind does not have that record (asn=-1). [12:57:12] Hey guys, currently battling a DHCP/ISP issue [12:57:31] In better news: https://www.comicsbeat.com/funko-announces-official-this-is-fine-dog-pop/ [13:12:59] mforns: what is the asn? [13:13:22] 14907? [13:14:05] klausman: ack! [13:17:15] !log execute "cumin 'stat100[5,8]* or an-worker109[6-9]* or an-worker110[0,1]*' 'apt-get install -y linux-headers-amd64'" [13:17:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:17:43] klausman: spoke with Moritz about our issue, the linux-headers-amd64 solution seems ok [13:18:07] OK. Should we maybe add it as a preceding step in the rocm role? [13:18:21] (i.e. not as part of the package list there, but as a step before it) [13:20:15] could be an option yes, with a comment about the why [13:23:40] elukey: it's 64600 [13:25:20] ah yes private asn, okok [13:25:27] rocm role? 🎸🎸🎸 [13:26:09] gpu rocks :D [13:30:17] I can make a patch once my dhcpcd stops falling over [13:48:38] gilles: sorry I kept remembering and forgetting to respond, here is the referrer classification code: [13:48:39] https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/81744162364493d65ad746ab500f0302c0080ac6/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#270 [13:48:57] basically, empty urls, weird protocols, malformed domains, etc. [13:49:17] (some of that code looks a bit weird, I should add more test cases) [13:51:33] milimetric: thanks, I get it now. that's a lot of non-URL referrers, when looking at user traffic shares [13:51:34] oh, no, the tests are there, I assume someone looked up whether the assertUnknowns don't happen in "real life" as valid referrers, but maybe not... [13:51:35] https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/81744162364493d65ad746ab500f0302c0080ac6/refinery-core/src/test/java/org/wikimedia/analytics/refinery/core/TestWebrequestRefererClassifier.java#47 [13:52:38] like www.google.com would be "unknown" and https://www.google.com would be something, but I'm not sure if the former is sent by any popular user agents [13:52:56] I wonder about what's in there, because it's a lot [13:53:41] it would be relatively easy to check, want me to process like an hour's worth of referer urls that are classified "UNKNOWN" and stick them in a temp table so we can look? [13:53:44] oh sorry, I misremembered [13:53:49] it's small [13:54:06] nevermind, but it does answer my questions, thank you [13:54:31] ok, cool, it's easy to check if something doesn't feel right [13:58:58] elukey: what do you think, pair on hue? [14:02:16] (03PS2) 10Milimetric: [WIP] Refactor state for cleanliness and consistency [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/631791 (https://phabricator.wikimedia.org/T262725) [14:02:51] milimetric: need 10/15 mins, is it ok? [14:02:58] elukey: of course, take your time [14:08:08] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Hi @ayounsi, can you help me? I have some more questions: * What is the field that we want to extract the AS name for? I see as_src, as_d... [14:12:31] heh, we can just leave whatever we want in post-install hooks? [14:12:33] "Also, the author of core-js ( https://github.com/zloirock ) is looking for a good job -)" [14:13:18] oooh, now I'm suspicious that someone's looking for that string to launch an attack :/ [14:19:10] milimetric: ok! bc? [14:19:16] yes, omw [14:34:40] gilles: unknown referrer will also include empty (if i remember this right) [14:35:38] gilles: no, wait , i changed this (duh) https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/81744162364493d65ad746ab500f0302c0080ac6/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/RefererClass.java?autodive=0%2F [14:35:53] Hi team - what have I missed? [14:36:36] gilles: it is important to only look at user traffic (agent_type='user') otherwise the refferer data is all sweked by bots [14:38:07] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) > * What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst? Ideally all of them, but a... [16:10:54] 10Analytics, 10Operations, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10lexnasser) @elukey Yep, that's the correct email. I also confirm that I'm now able to access Turnilo and Stat1007. Thanks for your help! [16:12:17] 10Quarry: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10Bstorm) [16:12:21] 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10Bstorm) [16:12:47] 10Quarry: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10Bstorm) This would be trivial to add to the work on the parent task I just added. Therefore, adding it. [16:19:21] nuria: yes I was looking at users only [16:23:26] 10Quarry: Quarry seems to hang somehow - https://phabricator.wikimedia.org/T265155 (10Wurgl) [16:43:02] gilles: k [16:44:31] 10Analytics, 10MediaWiki-REST-API, 10Platform Team Sprints Board (Sprint 5), 10Platform Team Workboards (Green), 10Story: System administrator reviews API usage by client - https://phabricator.wikimedia.org/T251812 (10eprodromou) I think if we are getting the data into analytics, we're good. I'll follow... [16:54:22] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) > > What is the field that we want to extract the AS name for? I see as_src, as_dst, peer_as_src, peer_as_dst? > Ideally all of them, but... [17:04:34] can etcd be read from the analytics network? On first look it seems like not currently, but maybe i missed something [17:05:06] ebernhardson: not possible correct, can I ask why you'd need it? [17:06:36] elukey: i'm parameterizing the active DC in our airflow so it correctly looks for eventgate. I have it as a constant, but it could be sourced from etcd [17:06:43] for eventgate partitions i mean [17:08:16] seems a good use case, even if in theory it could be easily solvable with a puppet hiera flag (switchovers happens so infrequently..) [17:08:38] hmm, yea puppet could set it too. do we change a variable somewhere in hiera? [17:08:42] I am saying that since the analytics vlan firewall is meant to "protect" production from the power of the hadoop/etc.. [17:08:51] at least, this was its original plan [17:08:58] (it pretates me though) [17:09:21] yea, i was under the impression it was also put in place because we have researchers and such in analytics that shouldn't be touching prod [17:09:32] exactly yes, and etcd is very delicate [17:09:49] but in theory it should also be authenticated [17:09:53] happen to know which puppet thing in hiera? I can poke around but i suspect the codfw string is everywhere :) [17:10:42] I am not sure if we have anything in puppet currently, I'd ask to serviceops to be sure.. if it is a hassle we can open ports for etcd ebernhardson, it should be fine [17:11:05] (03PS9) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) [17:11:09] i mean, this happens rarely enough we can ship a patch and change it twice a year, it just seemed nice to source from a place of truth :) [17:12:18] yepyep :) [17:12:45] even a custom puppet flag for airflow could be enough [17:16:28] 10Analytics, 10Analytics-Kanban, 10Privacy Engineering, 10Product-Analytics, and 3 others: Drop data from Prefupdate schema that is older than 90 days - https://phabricator.wikimedia.org/T250049 (10nettrom_WMF) @Milimetric : I inspected the sanitized data by looking at the event structs of random partition... [17:18:06] 10Quarry: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10Bstorm) I dug around a little bit and realized it isn't as simple as what I stated above. The reason is that those are read-write databases, which dramatically changes the scope of what Quarry could... [17:18:16] 10Quarry: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10Bstorm) [17:18:19] 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10Bstorm) [17:19:56] 10Quarry, 10cloud-services-team (Kanban): Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10Bstorm) [17:20:59] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) > Or maybe I misunderstood what needs to be done here... I assumed we want to determine whether the given IP is v4 or v6. But which IP w... [17:22:23] Gone for tonight team - have a good weekend :) [17:22:47] you too! [17:22:50] (03PS10) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) [17:31:05] 10Analytics, 10Analytics-Kanban, 10Privacy Engineering, 10Product-Analytics, and 3 others: Drop data from Prefupdate schema that is older than 90 days - https://phabricator.wikimedia.org/T250049 (10Milimetric) @nettrom_WMF it is kind of time intensive, it would take me about 2 days of work. I know it seem... [17:33:06] elukey: grazie mille for the LGTM on the rocm stuff [17:35:53] prego! :) [17:41:25] * elukey afk! [17:52:57] 10Analytics, 10Analytics-Kanban, 10Privacy Engineering, 10Product-Analytics, and 3 others: Drop data from Prefupdate schema that is older than 90 days - https://phabricator.wikimedia.org/T250049 (10nettrom_WMF) @Milimetric : Not a problem, definitely understand that this would be a non-standard request! I'... [18:54:14] 10Analytics: Request a Kerberos identity for sbisson - https://phabricator.wikimedia.org/T265167 (10SBisson) [20:23:39] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10cchen) Thank you @JAllemandou !! [20:52:45] 10Analytics: Check home/HDFS leftovers of rush - https://phabricator.wikimedia.org/T265121 (10Peachey88) [21:11:48] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10wiki_willy) a:05wiki_willy→03Cmjohnson PDUs were shipped out today and should arrive next week. Assigning back to @Cmjohnson to complete... [21:56:43] (03CR) 10BryanDavis: multiinstance: Attempt to make quarry work with multiinstance replicas (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [22:39:36] 10Analytics, 10Analytics-Kanban, 10Privacy Engineering, 10Product-Analytics, and 3 others: Drop data from Prefupdate schema that is older than 90 days - https://phabricator.wikimedia.org/T250049 (10nettrom_WMF) @Milimetric : It looks like there's no data in `event_sanitized.prefupdate` for 2020-09-19 throu... [22:49:41] mforns: (for monday) that blip on the entropy for os_family for access_type="desktop" is a bot requesting the page 'Bible' [23:04:24] (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [23:06:15] (03PS11) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) [23:12:17] mforns: is a bot that flies under the radar [23:12:26] https://usercontent.irccloud-cdn.com/file/qtGJEsHt/Screen%20Shot%202020-10-09%20at%204.11.51%20PM.png [23:12:52] https://usercontent.irccloud-cdn.com/file/QneWqLwT/Screen%20Shot%202020-10-09%20at%204.12.36%20PM.png [23:16:45] mforns: i am going to not do desktop os-entrophy alarms for pageviews , cause undetected moderate bot spikes will skew results [23:18:50] joal: (for monday) this is actually a cool finding that further ratifies our work in bots [23:19:23] joal: the removal of that traffic in april is what actually makes this one be a lot more "predictable" [23:19:57] https://usercontent.irccloud-cdn.com/file/hiiAs3cI/Screen%20Shot%202020-10-09%20at%204.19.37%20PM.png [23:20:29] joal: this is entrophy of user-agent['os_family'] per access_method