[02:23:53] 10Analytics, 10WMDE-Analytics-Engineering: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10GoranSMilovanovic) [02:24:10] 10Analytics, 10WMDE-Analytics-Engineering: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10GoranSMilovanovic) p:05Triage→03High [02:25:17] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10GoranSMilovanovic) [02:42:55] nuria: heads-up that I won't be at standup tomorrow, but will be there Tuesday [03:10:39] lexnasser: k [06:25:37] goog morning! [06:34:03] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10elukey) @GoranSMilovanovic the mysql jdbc driver is not present on Debian Buster, but `org.mariadb.jdbc.Driver` is. Can you try to add something like `-D org.mariadb.j... [06:44:19] (03CR) 10Elukey: [V: 03+2 C: 03+2] Upgrade to upstream version 1.27.0 [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/634500 (https://phabricator.wikimedia.org/T233336) (owner: 10Elukey) [06:47:12] !log turnilo upgraded to 1.27.0 [06:47:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:58:55] 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10elukey) I checked `tcpdump -i lo port 9091` on an-tool1007 (where turnilo runs) and I see the following request headers landing to turnilo: ` X-Client-IP: 10.20.0.52 X-Forwarded-For: 93.34.XX.XX,... [07:09:01] Good morning [07:15:22] bonjour [07:15:37] How are you elukey? [07:16:46] good, and you? [07:17:02] all good :) [07:18:28] 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10elukey) @Milimetric let's sync about what to do, I can try to mangle X-Client-IP from httpd (didn't find a quick way yet) or we could let turnilo also read `X-Forwarded-For` (IIUC it reads only X-... [07:27:52] !log decom analytics1055 from the hadoop cluster [07:27:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:30:20] joal: by the end of this week we should have the backup nodes all ready [07:30:35] (the 16 to decom I mean) [07:30:41] \o/ [07:30:48] elukey: how many do we have currently [07:30:49] ? [07:31:08] 12 I think [07:31:29] elukey: if they are already added to the backup cluster, can I start copying? [07:31:51] joal: we need to bootstrap the cluster first, I'll let Razzi and Tobias doing it [07:31:54] elukey: if not for real, at least some real data, to check speed and reliability? [07:32:03] Ah ok, sorry :) [07:46:25] there is a bit of work to do, namely [07:46:27] reimaging [07:46:33] re-init all partitions [07:46:39] configuration for the cluster [07:46:42] puppet etc.. [09:10:39] 10Analytics, 10Operations, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10elukey) @CDanis I added the following config to turnilo on an-tool1005 (staging instance): ` measures: - name: bytes title: Bytes... [09:20:17] elukey: Heya - I'm interested to have your opinion on my message about AQS [09:24:14] joal: seems what marcel pointed out [09:24:45] from a quick glance about numbers, I am still a bit on the fence, we didn't see any impact on aqs1005 for example [09:24:57] anyway, it seems the most plausible explanation [09:25:42] (will re-check all the graphs later on, thanks a lot for collecting the info) [09:25:47] elukey: I assume it's chance (or bad-chance) whether we get very high latency once every now and then [09:26:12] elukey: happy to talk more when you wish :) [09:27:13] the memory part is what I doubt a bit, see the memory panels in https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview?orgId=1&var-site=eqiad&var-cluster=aqs&var-instance=All&var-datasource=thanos&from=now-7d&to=now [09:27:31] indeed we have less ram on some nodes, but I don't see variations of cached memory when under pressure for example [09:27:41] 10Analytics, 10Operations, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10faidon) Yay, that's awesome! You can't imagine how much time this would save! I changed the config a little bit. Specifically: * Bits per second is more... [09:27:51] I mean, there are but not horrible ones [09:28:41] mmmm maybe I need to check those more [09:29:44] I was thinking about disk perfs but no, the IOPs are the same [09:29:53] elukey: from that dash, I looked at the `Disk throughput per host` panel [09:30:42] In there the write lines are similar between hosts, but the reads seems oriented toward older hosts [09:31:00] also check disk usage % [09:31:22] there is a difference in there, more notable, but again I wouldn't expect timeouts to happen [09:31:52] elukey: I assume disk usage is due to smaller RAM: less caching [09:32:39] could be yes [09:35:37] but we are around 10/15% of usage, not even close to something that could show pressure on those SSDs [09:35:51] yeah, true [09:36:17] joal: coffee? [09:36:22] sure elukey [09:36:25] joining the cave [09:40:00] (03CR) 10Ayounsi: [C: 03+1] "Only checked the first comment with the fields names and descriptions, and they LGTM." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [10:06:45] G'day [10:07:09] I love fall, for the weather is nice and changing, and it's not too hot. I don't like the attendant migraines :-S [10:08:44] good morning! [10:09:19] !log add pps/bps measures to wmf_netflow in turnilo [10:09:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:09:35] Hi klausman [10:10:55] joal: very interesting https://gerrit.wikimedia.org/r/c/operations/puppet/+/634931 [10:11:40] Nice! [10:11:48] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10elukey) a:03elukey [10:12:04] all credits to Faidon for the final config [10:12:11] but it seems working really well [10:27:26] 10Analytics, 10Analytics-Kanban, 10Security: Review request for data export - https://phabricator.wikimedia.org/T264255 (10MoritzMuehlenhoff) >>! In T264255#6547266, @Ottomata wrote: > @MoritzMuehlenhoff, would it be ok to temporarily re-add @Groceryheist's access while he copies out some data? There's no c... [10:31:33] That's neat. [10:35:54] going afk for lunch break, see you in a couple of hours :) [10:36:51] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10GoranSMilovanovic) @elukey Thank you, it worked. However, another problem emerged: it seems that the `analytics-privatedata` user (the one we impersonate for Kerberos... [10:54:23] 10Analytics, 10Patch-For-Review: Add urlshortener button to Turnilo - https://phabricator.wikimedia.org/T233336 (10Milimetric) Turnilo is just running on top of Express, and passes the trustProxy setting on to Express, I believe that’s what we need. I’m updating my puppet change to include setting it to alway... [12:00:54] (03CR) 10Joal: "A bunch of things here." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [12:09:15] lol I said good morning team in the ops channel instead of here [12:48:53] fdans: I knew that you had a SRE inside your soul :D [13:18:34] (03CR) 10Ottomata: Add Refine transform function for Netflow data set (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [13:18:46] (03CR) 10Ottomata: "(Dunno why my comment didn't save)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [13:21:27] (03CR) 10Ayounsi: [C: 03+1] Add Refine transform function for Netflow data set (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [13:28:19] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10elukey) @GoranSMilovanovic ah yes the warning/exception is due to the fact that beeline wants to record your command into the analytics-privatedata's history file, but... [13:43:30] hi elukey [13:44:28] HELLO DAN [13:44:34] :D good morning [13:47:26] mornin' did you see my comments? I'm eager to put this to bed [13:47:53] and again I'm sorry I didn't go into detail on Friday, I hope it didn't waste you too much time [13:52:04] ah yes I was about to try it [13:54:32] milimetric: just applied manually, still see "Blocked users can't make short URLs." [13:55:48] milimetric: I don't get the change, 'X-Forwarded-For': context.clientIp, should put localhost in the X-Forwarded-For sent to meta no? [13:58:19] so, trustProxy should make Turnilo take the XFF header that you pasted in the task, split by comma, and take the first IP as the request's IP address [13:58:48] my Turnilo PR considers whether trustProxy is on and uses the correct request.ip field which should be populated that way [13:59:02] (and passes it as context.clientIp) [13:59:11] so something's not working right [14:09:27] ah! [14:09:28] does that make sense elukey, or am I missing something that would still only show Turnilo localhost in *its* XFF headers? [14:09:48] nono in theory it should work [14:13:51] to find out, we'd have to log what Turnilo thinks is in req.ip and req.connection.remoteAddress here: [14:13:51] https://github.com/allegro/turnilo/pull/657/commits/edf71714153aeeff484563a7216fdfae9e197d0a#diff-b0b09fbb7870d25e329e6e17e825e9be831eb1b8d78439db85ddf6c09be3a466R33 [14:14:06] to see if they're different with/without trustProxy on [14:14:13] that should tell us pretty clearly what's not working [14:14:52] I think if I just add a console.log there we don't need to restart or anything, should be really safe even in prod [14:14:56] joal, mforns - ok for tomorrow's meeting with Jarek? [14:15:08] I'd do it in testing but I'm not 100% clear whether or not the ssh tunnel messes with the XFF [14:15:29] do it in testing first just to see if anything explodes, and then ok for the quick prod test [14:17:16] hey teamm! [14:17:20] elukey: yes tomorrow good [14:17:28] super :) [14:17:32] elukey: for me at least [14:35:55] I found the JDBC string to use TLS between hive and mysql [14:36:25] achievement of the week, I tested so many configs [14:40:02] !log restarted eventlogging-processor with filter to skip events already migrated to event platform - T262304 [14:40:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:40:06] T262304: eventlogging-processor should fail to produce schemas that have been migrated to Event Platform - https://phabricator.wikimedia.org/T262304 [14:59:57] elukey: tomorrow's meeting with Jarek is good also for me :) [15:00:57] ping ottomata [15:09:40] 10Analytics-Radar, 10Product-Analytics, 10Anti-Harassment (The Letter Song): Capture special mute events in Prefupdate table [4 hour spike] - https://phabricator.wikimedia.org/T261461 (10Mholloway) [15:26:41] 10Analytics-Clusters, 10Operations: Rename an-scheduler1001 to an-coord1002 - https://phabricator.wikimedia.org/T265620 (10Cmjohnson) [15:28:28] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Disable eventlogging-valid-mixed topic - https://phabricator.wikimedia.org/T265651 (10Ottomata) p:05Triage→03Medium [15:31:18] 10Analytics-Radar, 10Growth-Team, 10Product-Analytics, 10Product-Infrastructure-Data, 10MW-1.36-notes (1.36.0-wmf.12; 2020-10-05; NEVER DEPLOYED): PrefUpdate captures user preference modifications at registration - https://phabricator.wikimedia.org/T260867 (10sdkim) 05Open→03Resolved a:03sdkim [15:38:42] 10Analytics: Request a Kerberos identity for sbisson - https://phabricator.wikimedia.org/T265167 (10Nuria) @elukey to create kerberos credentials [15:48:33] 10Analytics, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Sqoop problem on stat1004 - https://phabricator.wikimedia.org/T265851 (10Milimetric) p:05High→03Low a:03elukey (ping self/@JAllemandou to update sqoop docs) [15:49:51] 10Analytics-Clusters, 10Analytics-Kanban: Upgrade AMD ROCm drivers/tools to latest upstream - https://phabricator.wikimedia.org/T264408 (10Milimetric) p:05Triage→03High [15:55:10] 10Analytics, 10Analytics-Kanban: Update Wikidata usage metric - https://phabricator.wikimedia.org/T264945 (10Milimetric) p:05Triage→03High a:03mforns metrics available so far on: https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/structured-data/ To correct, @Nuria suggests we g... [16:02:47] ping fdan [16:02:49] ping fdans [16:14:53] 10Analytics, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 2 others: SPIKE: consider all problems that might happen when we handle Google's privacy changes - https://phabricator.wikimedia.org/T265057 (10Milimetric) p:05Triage→03High [16:15:53] 10Analytics, 10Analytics-Kanban, 10Anti-Harassment, 10CheckUser, and 3 others: SPIKE: consider all problems that might happen when we handle Google's privacy changes - https://phabricator.wikimedia.org/T265057 (10Milimetric) a:03Milimetric [16:28:17] milimetric: FYI you were talking about doing some anaysis of per schema events? maybe with that cli tool [16:28:23] you can consume error specific kafka topics witih kafka cat [16:28:27] and group by schema [16:28:28] i'm sure [16:30:30] that's a cool idea ottomata :) [16:52:55] yeah, exactly ottomata, that's what I was thinking [16:53:07] https://phabricator.wikimedia.org/T265765 [17:34:09] 10Analytics: Quick data exploration CLI - https://phabricator.wikimedia.org/T265765 (10Ottomata) > how to consume from our kafka brokers > could take shorthand or intuitive names so we don't have to look stuff up EventStreamConfig can help you here: ` curl -s 'https://meta.wikimedia.org/w/api.php?action=stre... [18:01:00] 10Analytics, 10Analytics-EventLogging, 10JavaScript, 10Wikimedia-production-error: OperationError: The operation failed for an operation-specific reason in generateRandomSessionId - https://phabricator.wikimedia.org/T263041 (10Jdlrobson) 05Invalid→03Open I've seen this error for Firefox 81 https://log... [18:15:59] razzi: do you want to quickly meet for the kerberos task? [18:16:51] elukey: Yeah, let's batcave [18:17:11] 2 mins and I'll join [18:18:39] cool [18:20:04] ok I am in [18:32:20] Gone for today team - see you tomorrow [18:47:01] me too! [19:11:27] 10Analytics, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, 10observability: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (10Ottomata) [19:11:50] 10Analytics, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, 10observability: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (10Ottomata) [19:12:40] 10Analytics, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, 10observability: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (10Ottomata) [19:20:22] (03PS3) 10Jenniferwang: Add SpecialInvestigate schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628237 (https://phabricator.wikimedia.org/T262496) [19:20:59] 10Analytics-Radar, 10Event-Platform, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Krinkle) [19:28:10] 10Analytics-Radar, 10Event-Platform, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Krinkle) Why does it matter that it came from a specific IP address or user ID? If the error is common enough to hit our monitoring threshold, and investigation sh... [19:31:57] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Krinkle) `http.client_ip` was not intentionally added to this schema. It was inherited from an unrelated default... [19:37:48] 10Analytics-Radar, 10Event-Platform, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson) > Why does it matter that it came from a specific IP address or user ID? I thought T265131 makes this pretty clear - we have a lot of errors that stem... [19:40:38] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Jdlrobson) I understand and I'm not saying keep IPs. I'm just pointing out this will create challenges for us wi... [19:42:41] 10Analytics-Radar, 10Event-Platform, 10Instrument-ClientError: Bot throwing large amount of errors - https://phabricator.wikimedia.org/T264453 (10Jdlrobson) > E.g. by running a cheap hash (short, not one-to-one mappable, e.g. fnv32) over the wgUserId value, the GeoIP cookie value. Which will be reasonably st... [20:02:07] 10Analytics, 10Patch-For-Review, 10User-Elukey: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 (10razzi) After discussing with @elukey, we can leave stats.wikimedia.org running on port 8443, since it's not an address end users will see. [20:09:57] nice email razzi! :) [20:10:03] Thanks :) [20:14:33] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10JFishback_WMF) @Jdlrobson and @Krinkle - what about hashing the IPs? A hashed IP would still tell you how many I... [20:24:48] (03CR) 10Ottomata: [C: 03+2] Add camus-wmf-0.1.0-wmf12.jar with EventStreamConfig support [analytics/refinery] - 10https://gerrit.wikimedia.org/r/633800 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:24:50] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add camus-wmf-0.1.0-wmf12.jar with EventStreamConfig support [analytics/refinery] - 10https://gerrit.wikimedia.org/r/633800 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:47:53] PROBLEM - Throughput of EventLogging EventError events on alert1001 is CRITICAL: 129.1 ge 30 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [21:13:43] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Privacy Engineering, and 4 others: Remove http.client_ip from EventGate default schema (again) - https://phabricator.wikimedia.org/T262626 (10Nuria) >A hashed IP would still tell you how many IPs are involved, without revealing any individual IP For it t... [21:15:18] RECOVERY - Throughput of EventLogging EventError events on alert1001 is OK: (C)30 ge (W)20 ge 0.9845 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [21:35:01] (03PS1) 10Razzi: Remove postal code, latitude, and longitude from geodata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/635085 (https://phabricator.wikimedia.org/T236740) [21:53:16] 10Analytics-Radar, 10Product-Analytics, 10Anti-Harassment (The Letter Song): Capture special mute events in Prefupdate table [4 hour spike] - https://phabricator.wikimedia.org/T261461 (10Niharika) 05Open→03Resolved Thanks for the explanation @Mholloway. @jwang I agree that we can keep the data retained f... [22:03:57] 10Analytics: Check home/HDFS leftovers of nathante - https://phabricator.wikimedia.org/T256356 (10Ottomata) [22:04:03] 10Analytics, 10Analytics-Kanban, 10Security: Review request for data export - https://phabricator.wikimedia.org/T264255 (10Ottomata) 05Open→03Resolved I temporarily copied the nathante_wmf_export.tar.gz file that @Groceryheist prepared before he left to analytics.wikmiedia.org/published/datasets/one-off... [22:17:05] 10Analytics: Retain nonsensitive mediawiki_api_request logging data - https://phabricator.wikimedia.org/T265952 (10Maryana) [22:58:31] 10Analytics, 10Analytics-Wikistats: pagecounts-ez uploads stopped after 9/24 - https://phabricator.wikimedia.org/T265378 (10S1magreene)