[01:16:15] joal: for tomorrow, the patch is ready and I only managed to find one more clue about the druid loader problem. I ran it with a bad template file path on purpose, and it says this: https://hue.wikimedia.org/oozie/list_oozie_workflow/0014968-180308085316299-oozie-oozi-W/ So it's a different error than one we've been getting: https://hue.wikimedia.org/oozie/list_oozie_workflow/0014980-180308085316299-oozie-oozi-W/ [01:20:33] (03PS13) 10Milimetric: Compute geowiki statistics from cu_changes data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/413265 (https://phabricator.wikimedia.org/T188113) [06:59:09] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Alarm when /mnt/hdfs is mounted but showing no files/dirs in there - https://phabricator.wikimedia.org/T187073#4063876 (10elukey) ``` elukey@stat1005:~$ /usr/local/lib/nagios/plugins/check_mountpoint_readability /mnt/hdfs/ OK ``` Enabled it for stat1005... [06:59:23] hello people! [06:59:36] just deployed a new alarm that should only trigger in here for the moment [07:00:09] that is if /mnt/hdfs on stat1005 get flaky and even a simple ls is not executing correctly [07:05:45] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Alarm when /mnt/hdfs is mounted but showing no files/dirs in there - https://phabricator.wikimedia.org/T187073#3963409 (10elukey) [07:28:33] (03PS7) 10Fdans: Responsive Wikistats 2 UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/416999 (https://phabricator.wikimedia.org/T187345) [08:12:24] (03PS8) 10Fdans: Responsive Wikistats 2 UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/416999 (https://phabricator.wikimedia.org/T187345) [08:12:26] (03CR) 10Fdans: "@Mforns: thank you for the review!" (036 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/416999 (https://phabricator.wikimedia.org/T187345) (owner: 10Fdans) [09:00:42] joal: o/ [09:01:13] ah maybe you are not working this morning, I always forget :( [09:03:10] I needed to know from what host we grab sqoop-mediawiki-monthly-private but I found it, analytics-store.eqiad.wmnet [09:07:48] (the motivation is to remove all the garbage from the router's firewall rules) [09:17:01] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Refactor puppet code for the Hadoop Analytics cluster to roles/profiles - https://phabricator.wikimedia.org/T167790#4064031 (10elukey) After the last puppet patch we should be in a good position regarding role/profiles. The last bit to discuss/review/ref... [09:36:04] * elukey afk for a bit! [09:58:52] back! [10:10:10] !log removed old mysql/ssh/ganglia analytics vlan firewall rules (https://phabricator.wikimedia.org/T189408#4055749) [10:10:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:10:22] all right old rules removed, if anything break is my fault :) [10:27:34] (03CR) 10Elukey: "There seems to be a new one from today: commons.wikipedia.." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/420346 (owner: 10Fdans) [10:28:43] fdans: --^ [10:28:44] :) [10:31:09] cool, adding! elukey [11:22:31] people something interesting happened: https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1521543722301&to=1521544645269 [11:22:54] a peak of 150k req/s for pageviews, I think distributed in multiple ips [11:23:02] (so rate-limit didn't kick in) [11:26:20] so no bueno [11:26:38] we returned a ton of 504s (gateway timeout) [11:35:25] * elukey lunch! [12:38:39] mforns: o/ [12:38:47] elukey, hey! [12:38:52] plop elukey \o [12:39:00] hey joal :] [12:39:04] Hi mforns :) [12:39:12] plop? xD [12:39:19] elukey: I usually work on mornings, depending on how late I have worked at night ;) [12:39:44] Yes mforns - plop as in apparating close by :) [12:39:55] xDDD [12:41:03] elukey, I finished modifications to the eventlogging_cleaner :], but now I have questions on the whitelist :[ do you have 5 mins? [12:42:04] sure! I was about to say that since there is a task about deploying the eventlogging_cleaner in labs, I can prioritize it so we can test in there first [12:42:24] joal: o/ - sure, I didn't remember what was the day that you join late in the evening [12:42:25] elukey, OK batcave for 5 mins? [12:42:26] elukey: from what I have seen, 504 from AQS, but nothing broke, right? [12:42:44] mforns: joining [12:43:17] joal: well those 504s were timeouts, so clients (beside the ones causing the spike in traffic) connecting in that moment experienced failures [12:43:36] and no throttling was enforced [12:43:47] elukey: Right, but internal systems were resilient [12:43:56] what do you mean? [12:44:24] cassandra nor AQS cashed [12:44:31] s/cashed/crashed [12:45:49] they didn't but either cassandra or restbase (didn't check yet) heavily timed out [12:45:59] elukey: I understand [12:46:11] elukey: for the client, same problem [13:15:06] joal: I’m more on baby dury but I left you a note last night, same problem after the folder move and coord/wf renames [13:15:15] *baby duty [13:16:51] milimetric: currently investigating - It could be because no data is available [13:17:30] I found data when I looked in the tmp folder, it didn’t delete it [13:18:14] milimetric: when looking at the milimetric.geowiki_daily table, I found it was empty [13:27:23] joal: hm, no, it's got stuff: hive (milimetric)> select * from geowiki_daily where month='2018-02' limit 10; [13:27:50] also, though, if you look at https://hue.wikimedia.org/jobbrowser/jobs/job_1520532368078_36099/single_logs [13:28:15] you find the tmp path and you can go: hdfs dfs -ls /tmp/0014978-180308085316299-oozie-oozi-W-json-private-geowiki-daily and see that that part of the druid job works ok [13:28:35] there's a file in there with size 16296235 and a _SUCCESS file [13:29:09] milimetric: Arf my bad - I wrote a query without quotes for the partition [13:29:22] milimetric: The query ran (?????), but returned 0 [13:29:40] uh... maybe a read permission problem? [13:30:17] milimetric: a formatting bizarre thing [13:30:30] elukey, you are a genious, I think you're right in we have to allow a defaults section in the whitelist... thanks! [13:30:38] When running with quotes (month='2018-02' instead of month=2018-02, worked fine) [13:31:42] oh, that makes sense, '-' means "minus" so it was probably looking for month=2016 :) [13:31:58] Mwarf !!! Didn't think about that milimetric !!!! [13:32:07] ok - So not a data issue :( [13:32:13] grumble grumble [13:34:51] I tried sending some other invalid parameters last night, like bad filenames, and it would error in the workflow before the shell call [13:35:38] is there a way to run just the shell locally somehow so we can get more info? This is kind of a blind debugging scenario right now. [13:38:55] milimetric: easiest would be to replicate the settings - will try [13:47:57] (03CR) 10Joal: "Comments inline. Still testing wh the druid loading fails." (0313 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/413265 (https://phabricator.wikimedia.org/T188113) (owner: 10Milimetric) [13:54:12] milimetric: just started a manual launch of the command after having copied the script and template_file - it worked on an1067;( [13:54:40] wtf [13:54:56] milimetric: indexation failed, but the job started an all (as expected) [13:54:57] this makes minimal sense [13:55:17] this actually doesn't make ANY sense ! [13:55:33] indexation I can understand, I’m sure I have to tweak something, this was just my first shot at it [13:58:18] milimetric: Can you retry the oozie job now? [13:58:39] milimetric: I have changed the memory setting for oozie (I do hope it's not only that) [14:00:44] elukey: looking at cassandra charts for the read spike - looked ok :) [14:00:52] elukey: I think 504 came from AQS [14:01:13] elukey: I'm gonna look for what type of requests were asked for [14:02:21] joal: well it might have beed 1) not enough workers on aqs side or 2) cassandra not answering promtly and forcing aqs to timeout [14:02:47] elukey: 50th percentile in cassandra has not moved during the spike [14:03:02] elukey: This is actually fairly scary :D [14:05:24] elukey: read latency in cassandra actually goes down during the spike ... WEIRD !!! [14:07:02] working on el beta, will check later :) [14:07:08] np elukey [14:26:46] joal: retrying now [14:26:55] milimetric: currently doing it [14:26:59] milimetric: please hold :) [14:27:04] https://hue.wikimedia.org/oozie/list_oozie_workflow/0015686-180308085316299-oozie-oozi-W/?coordinator_job_id=0015685-180308085316299-oozie-oozi-C [14:27:07] milimetric: --^ [14:27:09] holding :) [14:27:13] Thanks :) [14:27:58] milimetric: FOND IT !!!! [14:28:16] milimetric: still not working, but due to druid loading config issue [14:28:33] yeah, that I figured, ok so what's going on? [14:28:40] milimetric: original error was coming from a corrupted version of the druid_loader.py file under your folder [14:28:59] !!! but I have latest... and I rm -r before putting oozie up [14:29:09] milimetric: I have no clue :( [14:29:30] corrupted where, in hdfs under /user/milimetric/oozie...? [14:29:43] I noticed that in my latest run: https://hue.wikimedia.org/jobbrowser/jobs/job_1520532368078_38060/single_logs [14:29:50] in stderr tab milimetric [14:30:10] Look at the command: /usr/bin/env: ‘python\r’ not found [14:30:14] oh the \r [14:30:19] I didn't notice the \r last times [14:30:39] So I re-pushed the file onto our hdfs oozie folder - done [14:30:48] Man - That one was tricky @@! [14:31:10] I have no clue how a \r could have arrived in that file however [14:31:24] joal: uh... but the previous failures didn't have that: https://hue.wikimedia.org/jobbrowser/jobs/job_1520532368078_36099/single_logs [14:31:35] so that wasn't the problem [14:31:42] it might've been the problem with a test you were doing [14:32:00] milimetric: Ahhh - Could be because I have run the job as hdfs? [14:32:18] I also ran it as hdfs: sudo -u hdfs oozie job --oozie $OOZIE_URL -run -config oozie/mediawiki/geowiki/druid/coordinator.properties [14:32:40] I did that: sudo -u hdfs oozie job --oozie $OOZIE_URL -Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/$(date +%Y)* | tail -n 1 | awk '{print $NF}') -Dqueue_name=production -Doozie_launcher_queue_name=production -config /home/milimetric/refinery/oozie/mediawiki/geowiki/druid/coordinator.properties -run [14:32:41] oh! [14:32:47] oh! [14:32:50] ? [14:32:52] wait... [14:32:55] this makes no sense [14:33:14] if I do sudo -u hdfs, it would use the hdfs user's home? like hdfs://.../user/hdfs right? [14:33:30] so then how was it picking up the job at all? Which is only in /user/milimetric/oozie...? [14:33:39] hm, I think it uses pwd [14:33:51] current folder [14:34:43] lemme launch the same command you did [14:39:47] also milimetric - You should change the file oozie/util/send_error_email/workflow.xml under your oozie folder, so that your tests send emails only to you :) [14:40:34] ok, it works, but the only difference is the refinery_directory property, and my .properties file sets it to current - why is that bad? [14:40:53] milimetric: I have no clue :( [14:41:03] oh, I thought I was overriding the email thing by setting sla_email_address, my bad, is there a property I can set? [14:41:28] well, if ${name_node}/wmf/refinery/current is bad, that would screw up everything, wouldn't it? [14:41:37] milimetric: easiest is actually to change the file I think [14:43:08] ok, my bad, I thought I was doing that, sorry for the spam! [14:54:28] (03CR) 10Joal: "Another 2 comments while reviewing template." (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/413265 (https://phabricator.wikimedia.org/T188113) (owner: 10Milimetric) [14:59:48] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Implement EventLogging Hive refinement - https://phabricator.wikimedia.org/T162610#4064937 (10Nuria) [14:59:50] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Sanitize Hive EventLogging - https://phabricator.wikimedia.org/T181064#4064936 (10Nuria) 05Open>03Resolved [15:00:12] 10Analytics-Kanban, 10Patch-For-Review: piwik archive job should happen more frequently - https://phabricator.wikimedia.org/T188939#4064941 (10Nuria) 05Open>03Resolved [15:00:24] joal: you said you changed some memory settings? I think that was it, I ran every version of the command I could think of, and it works fine now [15:00:25] sudo -u hdfs oozie job --oozie $OOZIE_URL -run -config oozie/mediawiki/geowiki/druid/coordinator.properties [15:00:44] that's my original version, and now it gets past the command not found error [15:00:53] ping a-team [15:01:56] ping joal, mforns [15:02:29] milimetric: updated the memory setting in workflow.xml (see coment in CR, it's there as well) [15:28:01] 10Analytics-Tech-community-metrics, 10Developer-Relations: Investigate how to identify our code contributors to on-wiki code (gadgets, templates, modules) across WMF sites - https://phabricator.wikimedia.org/T190164#4065066 (10Aklapper) p:05Triage>03Low [15:37:50] joal: time to chat/brainstorm? [15:39:51] (will write in here and then we can chat later) [15:40:14] I checked on logstash (restbase dashboard) and the 504 *seems* to be generated by restbase since I see [15:40:17] Error: connect EADDRNOTAVAIL 10.2.2.12:7232 - Local (10.64.32.129:0) [15:40:24] the address is aqs.svc.eqiad.wmnet [15:40:53] elukey: can you link logstash dashboard? [15:42:22] sure! [15:42:31] https://logstash.wikimedia.org/app/kibana#/dashboard/restbase [15:43:12] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4065212 (10Nuria) ping @bblack could we possibly get this review done next quarter? [15:46:20] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4065219 (10BBlack) Yes, we've had this on the discussion list for ops Q4 goals (the elimination of the need for ipsec for caches<->kafka-brokers... [15:47:28] I am trying to get where we define how many aqs workers we have for each node [15:57:24] all right so the dashboard is not precise, since it is showing the count metric without the perSecond function applied [15:57:33] because numbers didn't make sense [15:57:44] the peak was ~2.7k rps [15:57:54] that is still a lot :) [16:00:26] UA was python-requests/2.18.4 [16:00:30] but different IPs [16:02:37] elukey: and requests? [16:02:53] elukey: if you have them otherwise i can look it up [16:03:49] nuria_: from logstash is difficult to judge, but surely per-article reqs.. From Hive we'll get a better picture [16:04:52] so yeah the 150k requests were the count, 2.8k rps * 54s [16:11:14] ok so number of workers == number of available cpus [16:11:22] in our case, 32 for each node [16:13:51] (192 total) [16:14:05] asking to services how requests are handled by hyperswitch [16:23:34] 2800 rps / 192 workers are 14 rps / worker, that is not a ton [16:24:05] if of course the time taken to return a singe reponse is good [16:25:19] checking the latency graphs, again they are not precise [16:25:24] (for aqs) [16:25:42] since they plot mean, if I set to max/count/etc.. I can see a big spike in latency even for mean [16:25:45] fixing them now [16:26:46] mmmmm no something is weird [16:30:14] switched to ALL rather than 2xx latency [16:30:48] p50 is ~120ms [16:31:00] p75 is 267ms [16:32:45] but! [16:32:53] https://grafana.wikimedia.org/dashboard/db/restbase?panelId=15&fullscreen&orgId=1&from=1521544007975&to=1521544282670 [16:33:15] this is probably the sum of all the reqs, external+internal, and it peaks to 7.2k rps [16:33:39] highestMax($cluster.external.v1_*.*.ALL.sample_rate, 20) [16:39:53] the sum that I get is ~4.5k rps [16:40:21] summing all the graphs during the spike (didn't do it before, just picked the first) [16:43:49] also very interesting, speculative retries increased for cassandra https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&panelId=20&fullscreen&from=1521543961197&to=1521544389303 [16:44:13] Hi elukey - here now [16:44:17] elukey: batcave? [16:44:44] joal: sure [16:58:49] elukey: they should plot median not mean [16:59:05] elukey: so p50 not "average", amkes sense? [16:59:07] *makes [16:59:59] elukey: have meeting but will ping you guys after [17:00:21] nuria_: they were correct but only related to 2xx, I switched to ALL and latency changes appeared [17:01:05] elukey: ok, will look after meeting [17:01:28] super [17:18:17] tgr|away: question. is the tag refactor happening next quarter? [17:30:30] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog: Should it be possible for a schema to override DNT in exceptional circumstances? - https://phabricator.wikimedia.org/T187277#4065742 (10ovasileva) [17:37:12] 10Analytics, 10Analytics-EventLogging, 10Patch-For-Review, 10User-Elukey: Run eventlogging purging script on beta labs to avoid disk getting full - https://phabricator.wikimedia.org/T171203#4065783 (10elukey) Deployed mariadb via proper puppet profile and reconfigured, now the mariadb instance is not manua... [17:40:40] elukey: on batcave [17:40:54] elukey: sorry, are you guys on batcave? [17:41:59] nuria_: nope we finished a while ago [17:42:03] elukey: i think this graph should say "median" instead of "mean" [17:42:05] https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=now-2d&to=now&panelId=12&fullscreen [17:42:46] sure sure feel free to fix the dashbaord [17:42:52] *dashboard [17:43:19] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (watching): Create mediawiki.job.refreshLinks_partitioned topic - https://phabricator.wikimedia.org/T190196#4065857 (10Pchelolo) p:05Triage>03High [17:45:13] elukey: how do i edit it? [17:45:36] so the TL;DR of me/joseph talking is that 404s might have played a big role, since they are not "light" reqs as it seems for AQS. Restbase returned 504s (gateway timeouts) after failing with "Error: connect EADDRNOTAVAIL 10.2.2.12:7232 - Local (10.64.32.129:0)" (the IP is the AQS LVS IP), so possibly aqs slowed down at the point that it caused restbase timeouts [17:46:11] elukey: ah puppet, will do one sec [17:46:22] nuria_: the url needs to be rewritten as "grafana-admin" etc.. and then edit on the graph [17:46:31] elukey: oohhh [17:47:27] nuria_: I can do it if you want, you should have access to grafana-admin but not sure now that I think about it [17:47:37] (I am 90% sure about yes) [17:48:28] elukey: just did it [17:48:34] elukey: changed to median [17:48:42] elukey: label and computation [17:50:15] elukey: so outliers are not factored in into the latency computation, it will give more realistic spikes [17:51:06] elukey: so what where the requests returning 404? real requests or garbage ones? [17:51:39] nuria_: did you save the dashboard? [17:52:04] elukey: ahem... [17:52:07] i guess not [17:52:12] trying again [17:52:15] sorry forgot to tell you :( [17:53:02] elukey: now [17:54:08] yep! [17:54:27] elukey: and , as you said, latencies are calculated just in the 200 responses right? [17:54:29] so about the 404s: joseph is going to make an analytics using webrequest data, so we'll be able to know more [17:54:43] nuria_: not now, I switched to ALL [17:55:03] elukey: for latency? [17:55:08] yep [17:55:19] we can also have latencies broken down by return code probably [17:55:25] elukey: i think it should only be 2000 [17:55:27] 200 [17:55:46] elukey: otherwise 500 situations will lower our latencies as erros as fast to return [17:55:56] elukey: right? [17:56:14] elukey: normally latencies are calculated with http >=200 [17:56:26] elukey: otherwise you are factoring in 404 responses [17:57:35] nuria_: in this case it seems that 50x responses were carrying a higher latency, that is a bit weird [17:58:04] elukey: yes, that is weird but our main latency metric should not include that [17:58:16] but I got your point, just switched to 2xx [17:58:18] elukey: that actually seem toindicate a timeout [17:58:27] elukey: rather than an aerror [17:58:50] yes [17:59:00] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog: Should it be possible for a schema to override DNT in exceptional circumstances? - https://phabricator.wikimedia.org/T187277#4065964 (10ovasileva) 05Open>03Resolved a:03ovasileva Created {T190188} for the changes proposed above. Resolving t... [17:59:29] a 504 one [18:01:04] elukey: mmm.. seems query timing out in cassandra [18:01:11] elukey: that shoudl be easy to verify [18:01:50] nuria_: those as far as I can see are restbase metrics, so the 504s are the ones seen by it [18:02:10] elukey: ok, let me look at urls of incoming requests [18:02:11] I think that AQS just slowed down and piled up requests [18:03:39] Hi all! Any thoughts on this last comment here? https://phabricator.wikimedia.org/T185933#4056789 [18:04:35] 10Analytics-Kanban: Add defaults section to WhitelistSanitization.scala - https://phabricator.wikimedia.org/T190202#4066032 (10mforns) [18:04:43] in the meantime, added 4xx latencies as well to the graphs [18:05:17] (03PS1) 10Mforns: Add defaults section to WhitelistSanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/420795 (https://phabricator.wikimedia.org/T190202) [18:05:53] AndyRussG: is donate wiki not backed up by varnish? [18:07:08] I just validated my theory (and I feel dumb), the restbase.* namespace is what we are using, and there is also a aqs.* one [18:07:13] * elukey cries in a corner [18:07:24] wait ... me no compredou [18:07:27] elukey: [18:08:30] nuria_: it is, it's varnish cached [18:08:34] nuria_: so if you take a look the metrics that we are plotting in aqs-elukey they start with restbase.*.etc.. [18:08:46] AndyRussG: then the pageviews are already on webrequest table [18:08:47] so php-side EL won't work [18:08:48] so those are the restbase ones, that are good but not emitted by aqs [18:10:26] nuria_: yeah, that's where we've always gotten the data from. This new strategy was suggested to replace the current one that relies on Kafkatee and heavy filtering, since with EventLogging we can get a separate Kafka topic. This is as suggested by ottomata (IIRC), just double-checking that I understood correctly and that it's the right approach [18:10:33] AndyRussG: ah i see, you are trying to add referrer url info, ok [18:10:42] AndyRussG: ya, makes sense [18:10:52] nuria_: um hmmm not sure if we need referrer [18:11:22] AndyRussG: referrer in the sense of "source" of your campaign [18:11:57] Hmmm yes... I mean, currently all the data is on the URL [18:12:21] and it's piped and filtered and sampled via Kafkatee and slurped into a database by a python script on the FR cluster [18:12:31] this is the pipeline that broke a few months ago [18:12:40] and the suggestion was to move to EventLogging [18:12:48] to get a separate Kafka topic [18:13:44] my concern in the comment on the Phab task is that, with the new suggested method, we'd in the end be performing an extra request on the client to send to our infrastructure data that was already fully available on the cluster [18:14:40] So I was wondering if there was somehow some other way to trigger the EL event server-side, other than via PHP (since, as you noted, the pages are Varnish cached) [18:15:14] AndyRussG: I think schema to send EL events and general idea looks fine, i see no issue with it. [18:15:26] nuria_: thanks! [18:16:01] so, it's correct that the best way to get the event is to send it from the client--no other server-side options available, I guess? [18:22:59] nuria_: take a look (if you have time) to the dashboard now [18:23:32] elukey: looking [18:24:22] elukey: i see, and now the 429s will show up... [18:25:22] sorry now is better, changed the latency as wel [18:25:29] elukey: ah i see, they are on the rate limit graph , just not disclosed [18:25:42] so those are not the rate limit ones, but 504s [18:25:43] mmm.. not sure if disclosed is teh word [18:25:47] emitted by restbase [18:25:55] and aqs did not return any 5xx [18:26:34] elukey: looks like your theory is validated [18:32:03] updated all the graphs to use the aqs ones [18:32:11] now the thing that I don't get is latency metrics [18:33:22] elukey: now i get it ( i know, 3 days late) [18:33:24] because they don't show any increase afaics [18:33:35] elukey: we were not showing just aqs metrics but also others [18:33:58] elukey: loking again [18:34:15] nuria_: nope we were showing latency metrics for our apis from the restbase pov [18:34:27] elukey: pov? [18:34:51] point of view sorry :) [18:35:01] elukey: ah [18:35:24] elukey: ah, not counting cassandra lookups? [18:37:49] joal: what is the theory? [18:38:08] nuria_: as far as I know all the metrics starting with restbase.* are showing the latency for HTTP calls from restbase to aqs.svc.eqiad.wmnet, meanwhile the ones starting with aqs.* are showing AQS's latencies [18:38:23] elukey: ok [18:38:34] nuria_: requests were pilling up in AQS because of cassandra longer-than-usual ime to answer, and restbase timed-out [18:38:46] joal: i see a bunch of requests like [18:38:50] https://www.irccloud.com/pastebin/kzDUnRKF/ [18:39:07] see timerange , it is a year [18:39:10] cc joal [18:39:37] yup [18:39:54] plus a fait amount of 404 nuria, that are not handled by cache [18:44:00] elukey: Looking by user-agent, by minute: Our beloved python-requests [18:44:27] did 192125 200 request and 188177 404 requests at minute 11:09 this morning [18:54:10] nuria_: you mean change tags in MediaWiki? it's a Contributors project I believe so Roan would know best [18:54:22] RoanKattouw: hola [18:54:37] RoanKattouw: any word about the tag refactor in mw? [18:56:14] joal: take a look now to https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1521543814143&to=1521544407591 [18:56:44] that looks strange right? [18:57:12] but I keep wodering about Error: connect EADDRNOTAVAIL 10.2.2.12:7232 - Local (10.64.32.129:0) [18:57:28] and the fact that restbase times out after two minutes [18:57:35] elukey: What do you find bizarre? [18:57:44] the latency [18:58:15] I am wondering if those 504s were the result of LVS not being available [18:58:28] or overwhelmed for a brief moment [18:58:38] or maybe applying some limits [18:59:30] elukey: i see [18:59:38] elukey: interesting [18:59:45] elukey: teh latency is just not reported [19:00:12] elukey: seems more than anything aqs kaput [19:00:38] elukey: as there are no 200s? [19:02:12] nuria_: the only thing that I don't validate my LVS theory is that we have seen a lot of reqs landing to aqs [19:03:22] (03PS14) 10Milimetric: Compute geowiki statistics from cu_changes data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/413265 (https://phabricator.wikimedia.org/T188113) [19:03:31] elukey: I'm trying get more precise data than the plots [19:06:06] elukey: so learnings: a bunch of 404 (early timespans) lead to some 504s (unexpected) and 429 (expected) cc joal sounds right? [19:08:36] nuria_: no 429 registered, only 504s from restbase [19:08:55] elukey: got it [19:10:28] elukey: it "seems" 404s should be cached for a short time [19:11:20] that would make sense, but not super sure :) [19:11:48] nuria_,joal going offline now but will restart tomorrow :) [19:11:54] have a good evening/afternoon! [19:11:59] thanks for the brainbounce [19:12:00] 10Analytics-Kanban: Analyze surge of traffiq in aqs that lead to 504s - https://phabricator.wikimedia.org/T190213#4066315 (10Nuria) [19:12:00] Bye elukey - Preparing investigation [19:12:03] elukey: ciao [19:14:25] joal: do you know for certaing 404s are not cached? [19:15:03] nuria_: nope - But that'd be my guess [19:18:38] 10Analytics-Kanban: Analyze surge of traffiq in aqs that lead to 504s - https://phabricator.wikimedia.org/T190213#4066348 (10Nuria) Thus far we can see an increase of 404s that lead to 504s [19:19:09] 10Analytics-Kanban: Analyze surge of traffiq in aqs that lead to 504s - https://phabricator.wikimedia.org/T190213#4066349 (10Nuria) Please see: https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1521543814143&to=1521544407591 [19:23:48] joal: 188177 404s does not seem possible with our trottling limits if they come from same ips [19:23:59] nuria_: different ips of course [19:24:04] same user_agent [19:27:11] 10Analytics-Kanban, 10Services (doing): Analyze surge of traffiq in aqs that lead to 504s - https://phabricator.wikimedia.org/T190213#4066360 (10Pchelolo) In order to cache 404s in Varnish for a short period of time we'd need to add some config setting to hyperswitch and use that setting in AQS. RESTBase will... [19:30:09] joal: asked Pchelolo and indeed 404s are not cached or chaeable at this time [19:31:00] 11:54:36 RoanKattouw: any word about the tag refactor in mw? [19:31:10] RoanKattouw: yessir [19:31:11] nuria_: I wrote an RFC for it which is in last call now [19:31:23] RoanKattouw: to be implemented next quarter? [19:31:25] It'll pass tomorrow if nobody speaks up (or has spoken up, I haven't checked it in a few days) [19:31:39] Hopefully; our team was reassigned to work on maps so I don't officially have time for it [19:31:45] I wrote most of the code already though [19:31:57] Let me find you the RFC so you can see what the proposal is [19:32:10] https://phabricator.wikimedia.org/T185355 [19:37:00] (03CR) 10Milimetric: Compute geowiki statistics from cu_changes data (0314 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/413265 (https://phabricator.wikimedia.org/T188113) (owner: 10Milimetric) [20:05:43] (03CR) 10Joal: "One little left - rest is good for me once tested :)" (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/413265 (https://phabricator.wikimedia.org/T188113) (owner: 10Milimetric) [20:21:48] 10Quarry: User interface or query "jammed" on Quarry.... - https://phabricator.wikimedia.org/T190218#4066522 (10ShakespeareFan00) [20:34:47] nuria_: funny charts about aqs [20:34:53] nuria_: wanna see, or tomorrow? [20:43:56] No answer - Gone to bed :) [20:47:56] 10Analytics-Kanban: [EL] Correct Print schema mysql whitelist - https://phabricator.wikimedia.org/T190223#4066652 (10mforns) [20:53:28] joal: good! [21:22:07] 10Quarry: User interface or query "jammed" on Quarry.... - https://phabricator.wikimedia.org/T190218#4066522 (10Bstorm) This is caused by a disk full on the proxy. Working on it. [21:22:30] 10Quarry, 10cloud-services-team: User interface or query "jammed" on Quarry.... - https://phabricator.wikimedia.org/T190218#4066795 (10Bstorm) p:05Triage>03High a:03Bstorm [21:30:25] 10Quarry, 10cloud-services-team: User interface or query "jammed" on Quarry.... - https://phabricator.wikimedia.org/T190218#4066821 (10Bstorm) Verified the queries in Quarry work again after clearing logs. Access logs are filling due to a some OSM tiles traffic. [21:54:02] 10Analytics-Kanban, 10RESTBase-API, 10Services (doing): Analyze surge of traffic in AQS that lead to 504s - https://phabricator.wikimedia.org/T190213#4066903 (10mobrovac) [22:20:04] 10Quarry, 10cloud-services-team: Quarry server erros caused by Cloud VPS shared proxy failures - https://phabricator.wikimedia.org/T190218#4066966 (10bd808) [22:20:14] 10Quarry, 10cloud-services-team: Quarry server errors caused by Cloud VPS shared proxy failures - https://phabricator.wikimedia.org/T190218#4066522 (10bd808) [22:53:20] PROBLEM - YARN NodeManager JVM Heap usage on analytics1055 is CRITICAL: CRITICAL - analytics_hadoop_yarn_nodemanager is 3963686016 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [23:08:04] (03PS6) 10Nuria: Create and manipulate date objects according to UTC timezone [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/417476 (https://phabricator.wikimedia.org/T189266) (owner: 10Fdans) [23:57:34] (03CR) 10Nuria: [C: 04-1] "Still bugs on time selector but fixed issue reported by Marcel" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/417476 (https://phabricator.wikimedia.org/T189266) (owner: 10Fdans)