[01:39:28] (03CR) 10Milimetric: [C: 04-1] "Tests are great, code looks good, just one thing to remove and I think we're set." (034 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/657228 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [07:18:28] goood morning [07:23:50] Hello [07:27:40] 10Analytics: Increase segment replication factor on Druid Public to 3 - https://phabricator.wikimedia.org/T272670 (10elukey) [07:28:24] hehe :) --^ [07:29:06] milimetric: I am confused about the cassandra oozie SLAs, I noticed the "Killed" and I thought that too many loading jobs caused Cassandra to reject some (so no alerts or outages, that was related to druid reboots) [07:29:27] but I might have misuderstood so all good :) [07:29:43] I see that all cassandra coords are happy now so nice! [07:29:54] joal: bonjour :) [07:30:18] elukey: o/ [07:30:36] elukey: I inndeed thing there have some cross-actions on cassandra leading to misunderstanding [07:31:14] IMO cassandra complained yesterday because of druid restarts, and cassandra jobs alerts were all ok due to jobs restarts [07:31:35] elukey: thinking of druid-public [07:31:48] We have 5 hosts of 2.75Tb storage each [07:32:39] We currently store 4 mediawiki_history datasources, ~500Gb (less) each, with replication factor 2 [07:33:21] So, we store 4Tb shulffed around 5 hosts, ~800Gb per host [07:33:39] joal: yes the AQS alerts were all druid related, I meant the oozie ones, I thought that they failed after a big restart (all in once). I didn't notice they were monthly, when I restarted the other day more than one daily cassandra timeout out, so this is why I pinged Dan. [07:33:41] This number is coherent with what the service panel in druid-UI tells me [07:33:56] *timed out [07:34:18] No problem at all elukey - More precision, better undersntanding :) [07:34:24] ? [07:34:34] anyway [07:34:53] I am merging the hdfs cleaner change ok? [07:35:12] elukey: I meant that more detailed explanation leads to better understanding of systems, and also communication :) [07:35:40] elukey: We need the dataframe2druid correction to happen first [07:35:46] elukey: Let me submit a patch [07:38:32] ... [07:39:00] you can send it later on no problems, I have other tasks to check in the meantime [07:39:23] ack elukey - I'll also submit a task to keep only 3 datasources in druid-public [07:39:52] elukey: Then, bumping replication factor is doable by UI IRRC [07:43:36] 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10elukey) @razzi we'd need to write down a plan about how to proceed, since this in theory is going to be done during February. We have some constraints: 1) We need to... [07:48:40] elukey: I cou;d go with a little help in understanding puppet - let me know when you have a minute to batcave [07:52:51] sure, is it ok in 30 mins? [07:54:45] np problem [07:54:59] 10Analytics, 10Analytics-Kanban: Alter table for navigation timing errors out in Hadoop test - https://phabricator.wikimedia.org/T268733 (10elukey) This needs to be revisited when testing Bigtop 1.5, since it runs Hive 2.3.6, even if it shouldn't be fixed in there too. If so we'll just turn on the global optio... [08:42:13] ok joal I am free now [08:42:29] how can I help? [08:44:47] elukey: batcave for a second? [08:44:51] it'll be faster [08:45:18] sure [09:37:02] (03PS1) 10Joal: Fix DataFrameToHive repartition to empt failure [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657775 (https://phabricator.wikimedia.org/T272177) [09:37:30] (03PS2) 10Joal: Fix DataFrameToHive repartition-to-empty failure [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657775 (https://phabricator.wikimedia.org/T272177) [09:47:21] elukey: do you wish me to bump the default replication-factor on druid-public? [09:48:28] joal: +1 [09:48:52] !log Raise druid-public default replication-factor from 2 to 3 [09:48:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:13:21] interesting, in our worker puppet config we assume hive sqoop oozie etc.. [10:13:29] so I'll need to make those optionals [10:13:55] indeed elukey [10:14:33] joal: ok if I reboot archiva ? [10:14:42] sure elukey, no prob for me [11:31:19] ah joal can we check https://gerrit.wikimedia.org/r/c/analytics/wmf-product/jobs/+/651794 today to give the green light to Neil? [11:31:49] Yes elukey - will do [11:37:43] is there any reason why we have the analytics keytab on an-master? [11:38:15] I don't see timers for that [11:38:22] plus nothing runs as analytics [11:38:36] probably not elukey :) [11:38:59] ok so I'll remove that need :) [11:52:29] <- Lunch [11:54:31] PROBLEM - Hadoop DataNode on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:54:31] PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:55:37] RECOVERY - Hadoop DataNode on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:56:30] ahhh snap downtimeeee [11:56:51] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:56:51] PROBLEM - Hadoop DataNode on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:58:19] downtimed, these are all backup ones [12:02:00] (03CR) 10Joal: [C: 03+1] "I'm no bash expert but I think I understand most of the thing and it looks good to me. Minimal comments inline, no need to change." (032 comments) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/651794 (https://phabricator.wikimedia.org/T261953) (owner: 10Neil P. Quinn-WMF) [12:12:51] RECOVERY - Hadoop DataNode on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:14:31] https://dropbox.tech/application/why-we-chose-apache-superset-as-our-data-exploration-platform [12:29:22] elukey: ooh, interesting! [12:45:00] https://github.com/apache/superset/pull/8219 is very nice [13:30:02] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:30:04] RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:33:00] joal: we have the backup cluster :) [13:33:07] \o/ [13:33:18] * joal dances with elukey [13:35:03] I am adding metrics to see if all is good, but so far it seems so [13:54:24] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-backup-hadoop&var-worker=All [13:54:27] \o/ [13:54:48] available space ~560TB now [13:54:53] but we are missing two nodes [13:54:59] that will add +96T [13:55:14] plus we may add datanodes also to the master/standby [13:55:19] to get other +96 [13:59:12] PROBLEM - Age of most recent Hadoop NameNode backup files on an-worker1124 is CRITICAL: CRITICAL: 0/1 -- /srv/backup/hadoop/namenode: No files https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:59:14] PROBLEM - At least one Hadoop HDFS NameNode is active on an-worker1118 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:59:23] exactly, downtime expired [13:59:28] this is the backup cluster sorry [14:01:51] elukey@an-worker1118:~$ /usr/local/bin/check_hdfs_active_namenode [14:01:51] Hadoop Active NameNode OKAY: an-worker1118-eqiad-wmnet [14:02:04] something weird with icinga for sure [14:03:24] ah missing setting [14:08:22] RECOVERY - Age of most recent Hadoop NameNode backup files on an-worker1124 is OK: OK: 1/1 -- /srv/backup/hadoop/namenode: 0hrs https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:08:22] RECOVERY - At least one Hadoop HDFS NameNode is active on an-worker1118 is OK: Hadoop Active NameNode OKAY: an-worker1118-eqiad-wmnet https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:08:54] goood [14:10:06] hola hola team [14:10:41] hola mforns [14:11:21] (03CR) 10Ottomata: [C: 03+1] Fix DataFrameToHive repartition-to-empty failure [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657775 (https://phabricator.wikimedia.org/T272177) (owner: 10Joal) [14:12:17] reviewing this also ^ [14:19:47] (03CR) 10Mforns: Fix DataFrameToHive repartition-to-empty failure (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657775 (https://phabricator.wikimedia.org/T272177) (owner: 10Joal) [14:27:12] Hi mforns :) [14:27:19] heya joal [14:27:31] I will change the test, if you agree it's a typo [14:27:43] I can take it from here [14:28:49] mforns: I'll correct the thing and will let you deploy/rerun if ok [14:28:56] how you prefer [14:31:38] (03CR) 10Joal: "Thanks for the reviews - patch following" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657775 (https://phabricator.wikimedia.org/T272177) (owner: 10Joal) [14:38:07] (caught up with confusion now, re: aqs/cassandra alerts yesterday. I think looking longer term we should try to get more clarity out of AirFlow than we have out of Oozie.) [14:38:59] (03PS3) 10Joal: Fix DataFrameToHive repartition-to-empty failure [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657775 (https://phabricator.wikimedia.org/T272177) [14:40:30] joal, elukey: is it OK to deploy today to fix the refine alarms? [14:40:37] +1 for me [14:40:45] sure mforns [14:40:54] 👍 [15:00:45] Gone for kids, back at in 2h [15:05:16] 10Analytics, 10Event-Platform, 10Services: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) [15:05:23] 10Analytics, 10Event-Platform, 10Services: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) [15:05:25] 10Analytics, 10Event-Platform: eventgate-wikimedia should emit metrics about validation errors - https://phabricator.wikimedia.org/T257237 (10Ottomata) [15:05:53] 10Analytics, 10Event-Platform: eventgate-wikimedia should emit metrics about validation errors - https://phabricator.wikimedia.org/T257237 (10Ottomata) Related: {T268027} [15:32:50] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657775 (https://phabricator.wikimedia.org/T272177) (owner: 10Joal) [15:33:09] (03CR) 10Mforns: [V: 03+2 C: 03+2] Fix DataFrameToHive repartition-to-empty failure [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657775 (https://phabricator.wikimedia.org/T272177) (owner: 10Joal) [15:47:33] * elukey taking a break [15:49:25] 10Analytics-Clusters, 10Patch-For-Review: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10elukey) The cluster is up and running, together with metrics etc.. The current set up is: * two master nodes (an-worker1118 and an-worker1124) * 14 worker nodes, for a grant total... [15:53:48] 10Analytics, 10Event-Platform: produce_canary_events job should not fail if a schema is missing examples - https://phabricator.wikimedia.org/T270138 (10Ottomata) Hm, just looked at doing this, and it isn't really that easy to change. It does kinda suck that a schema without any examples will cause all canary... [16:08:44] (03CR) 10Ottomata: [C: 03+1] Add rdf-streaming-updater schemas for side outputs [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/647723 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [16:09:58] 10Analytics, 10Analytics-Kanban, 10Discovery, 10Product-Analytics, 10Research: New anaconda-wmf release with updated packages - https://phabricator.wikimedia.org/T271960 (10Ottomata) a:03Ottomata [16:15:29] (03PS1) 10Mforns: Update changelog.md for v0.0.145 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657865 [16:15:52] (03CR) 10Mforns: [V: 03+2 C: 03+2] Update changelog.md for v0.0.145 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/657865 (owner: 10Mforns) [16:16:51] Starting build #68 for job analytics-refinery-maven-release-docker [16:28:31] Project analytics-refinery-maven-release-docker build #68: 09SUCCESS in 11 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/68/ [16:30:59] * razzi out for an errand, back in 30 [16:41:20] Starting build #35 for job analytics-refinery-update-jars-docker [16:42:24] Project analytics-refinery-update-jars-docker build #35: 09SUCCESS in 1 min 4 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/35/ [16:42:26] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.0.145 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657872 [16:43:47] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/657872 (owner: 10Maven-release-user) [16:44:39] !log Deployed refinery-source v0.0.145 using jenkins [16:44:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:51:21] a-team: do we know what happened to the AQS cluster in deployment-prep? It seems not there anymore [16:51:30] ? [16:51:34] not there? [16:51:46] the vms are not running [16:52:01] no idea [16:52:06] better, I don't see any vm called deployment-aqs.. [16:53:24] I think that people may have asked us if we needed those and then dropped [16:53:35] oh [16:53:37] I was chatting with Lex about ssh access, and then I realized nothing was running :D [16:53:44] we didn't claim them? [16:53:55] I mean, it's not a question [16:54:12] we didn't claim them probably... [16:54:26] I am trying to see if this is in an email [16:55:04] https://phabricator.wikimedia.org/T257118#6632388 [16:55:32] Amir1: o/ [16:55:37] elukey: https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2020_Purge [16:55:51] aqs is there, as if it should be fine [16:56:28] I think that Amir had to shut them down in November because deployment-prep was full [16:56:38] I am wondering if we didn't answer to some ping or something similar [16:57:34] lexnasser: I think that we'll need to re-create the cluster from scratch, maybe under analytics.. Are you super blocked? [16:58:07] elukey: no, not super blocked. we can handle this next week [16:58:09] elukey: there's a thread on Jul 4th 2020 called: [Ops] Please clean your beta cluster instances [16:58:39] is there any other way to access a beta cassandra instance for AQS testing? [16:58:47] nope we don't have any [16:59:29] no worries! this can totally be handled next week - thanks so much for your help! [17:01:45] lexnasser: can you open a task asking to re-create the cluster? So we'll handle that next week [17:02:11] sure! [17:02:58] elukey: that sounds vaguely ffamiliar (removing deployment-aqs stuff), but i'm not totally sure [17:05:13] razzi, elukey or ottomata: please can you cr and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/657875 ? part of today's fix-deploy [17:05:13] 10Analytics: Re-create deployment-aqs cluster - https://phabricator.wikimedia.org/T272722 (10lexnasser) [17:06:29] mforns: Andrew was too kind, I'd have asked minimum 5 euros given it is friday [17:06:32] :D [17:06:42] mforns done [17:06:44] heheh, thanks ottomata :] [17:06:50] :) [17:09:44] !log bumped up refinery-source jar version to 0.0.145 in puppet for Refine and DruidLoad jobs [17:09:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:11:46] !log starting refinery deploy using scap [17:11:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:19:10] ottomata: question about eventgate, i thought since mediawiki_revision_recommendation_create events were merged to schemas-events-primary on dec 1 i would be able to emit an event, but eventgate says "Failed loading schema at /mediawiki/revision/recommendation-create/1.0.0". I suspect i've missed some critical step? [17:20:04] PROBLEM - Check the last execution of refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:20:39] mostly i was just trying to concvince the systems to create the related tables in the analytics network so i can deploy the stuff that will read from that event once they start doing them for real [17:20:54] the alert is about 'refinery-job-0.0.145.jar does not exist' [17:21:01] should be fixed with the deploy right mforns ? [17:21:23] elukey: which alert? [17:21:38] I deployed 145 before asking you guys to merge the bump up! [17:21:57] oh, I see the alert [17:23:00] mforns: I see up to 144 on launcher [17:23:01] PROBLEM - Check the last execution of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:23:21] ?? [17:23:35] ah ok now 145 is there [17:23:37] ok, rechecking [17:23:39] oh! [17:23:46] I think we needed to wait the deploy to finish [17:23:51] before merging the puppet change [17:24:03] elukey: but the deploy was finished no? [17:24:27] mforns: I saw the finish a minute ago in #operations [17:24:34] oh... [17:24:45] started at :14 and ended at :28 [17:24:49] err :23 [17:24:52] elukey: but that was refinery, not refinery source [17:25:06] mforns: yep exactly, but it is what the timers are using [17:25:16] ahhhhh, ofc [17:25:26] ok [17:25:28] TIL [17:25:35] :) [17:25:42] do you want to restart the failed jobs? [17:25:49] elukey: yes [17:26:30] ok lemme know if you need help, but it should be a simple systemctl restart in theory [17:26:54] ok [17:27:25] ebernhardson: which eventgate instance are you usingg? [17:28:24] !log restarted refine_event and refine_eventlogging_legacy in an-launcher1002 [17:28:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:29:35] ottomata: whichever one testwiki talks to when i grab eventgate service [17:29:50] sec lemme find the code [17:30:21] RECOVERY - Check the last execution of refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:30:42] ottomata: i ran this on testwiki from shell.php: https://phabricator.wikimedia.org/T262226#6664888 [17:31:45] ah ok right eventbus [17:31:47] uses eventgate-main [17:31:52] which has locally baked schemas in the image [17:31:57] to reduce runtime coupling [17:32:00] https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#eventgate-main [17:32:39] ok, so file a ticket for a new build? [17:32:45] we need to bump the repo and buildl rebuild the image [17:32:55] ebernhardson: i'll just use existing ticket to do [17:32:57] not hard [17:33:01] alright, thanks! [17:33:29] RECOVERY - Check the last execution of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:35:30] FYI ebernhardson https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/657880 [17:37:21] ebernhardson: do you mind if i wait until monday to deploy it? [17:37:27] just cause: friday [17:37:29] :) [17:38:00] ottomata: certainly! i'm switching a job from weekly to hourly at the same time, not deployign that on friday either :) [17:38:12] k coo [17:38:23] ebernhardson: also, feel free to remind me... :) [17:38:28] lol, ok [17:38:45] !log finished refinery deploy to HDFS [17:38:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:44:40] going afk for the weekend folks! [17:44:41] o/ [17:44:54] bye elukey - have a good weekend [17:45:01] you too! [17:45:59] byeeeeee! [18:38:26] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per [18:38:26] t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:38:38] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:39:14] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimed [18:39:14] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:40:20] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:40:58] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:45:28] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimed [18:45:28] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:45:54] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:47:26] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:50:28] https://grafana.wikimedia.org/d/000000526/aqs?orgId=1 [18:50:45] there seems to be an increase in latency and traffic [18:52:36] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per [18:52:36] t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:53:06] razzi: did you see --^ [18:53:20] no elukey, thanks for the ping :) [18:53:46] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:54:44] so there seems to be more traffic for AQS, but I wouldn't expect this to happen [18:54:55] https://grafana.wikimedia.org/d/000000417/cassandra-system?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=aqs&var-server=All&var-disk=sda&var-disk=sdb&var-disk=sdc&var-disk=sdd&var-disk=sde [18:55:12] cassandra on aqs nodes is a bit suffering, latency wise, it might be related to the queries [18:57:13] so we don't have really an access log for AQS, sadly [18:57:26] from tcpdump I see a lot of IPs from microsoft [18:57:54] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimed [18:57:54] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:59:20] user-agent: xuga@microsoft.com [18:59:24] is the UA [19:00:03] ok queries have ranges like 20100101/20210101 [19:00:12] ah yes that's huge [19:00:40] yep ok it is a bot hitting [19:00:57] probably not even malicious, but issuing queries with a large time window [19:01:13] razzi: to debug this I jumped on aqs1006 and ran 'sudo tcpdump tcp port 7232 -A ' [19:01:23] and then I have filtered for user-agent, etc.. [19:01:33] it is not really ideal but we don't have requests logged on disk [19:01:43] for historical reasosn [19:02:20] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wi [19:02:20] /Services/Monitoring/aqs [19:02:30] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:02:54] razzi: so we can do two things - 1) follow up with them sending an email to what indicated in the UA [19:03:02] 2) block the traffic via Varnish [19:03:22] I think these people were the ones hitting us with python-requests the other day [19:03:36] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:03:40] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:03:50] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:04:43] elukey: and they changed the UA? [19:08:38] razzi: I think so, but it is what the varnish block says, namely to comply to the Wikimedia API guidelines (like to add a contact to the UA etc..) [19:09:17] the block that I added was: "If the UA starts with python-requests, return a HTTP 403 that informs the user about the policy for UAs etc.." [19:09:28] so I just sent an email to the address indicated [19:09:41] asking to reduce the time window range [19:10:01] but I am not very hopeful [19:12:04] heya back from meeting [19:13:11] hola mforns [19:13:32] elukey: shouldn't thottling handle this? [19:13:37] *throttling [19:14:17] maybe the requests are spaced correctly, but too heavy in combination [19:14:22] mforns: I think that Varnish offers a very basic throttling per caching node, we never really had a good throttling [19:14:29] I see [19:14:31] in this case the traffic increased 4x [19:14:37] but the time window of the requests is huge [19:14:45] 20100101/20210101 [19:14:46] in #requests or in #bytes? [19:14:51] requests [19:14:56] https://grafana.wikimedia.org/d/000000526/aqs?orgId=1 [19:14:58] I see... wow [19:15:02] yes, ok [19:15:06] so cassandra is suffering, we should allow those queries [19:15:11] I think we discussed this in the past [19:15:13] yes ofc [19:15:40] so the UA says xuga@microsoft.com, that I contacted, but I am not really hopeful [19:15:57] yes [19:16:14] I was telling to Razzi that I think this is the python-requests UA bot that hit us previously during the week [19:16:25] but they've read our guidelines and changed the UA [19:16:33] so probably it is legit traffic [19:16:35] only too broad [19:16:40] understand [19:17:19] the other bad thing is that we don't have an access log on file [19:17:28] so I am doing tcpdump to indentify the queries [19:17:57] ah they stopped! [19:18:02] elukey: it seems by the intervals of CRITICAL->OK, that they are testing stuff [19:18:10] ah! cool [19:18:21] mforns: more than testing, see https://grafana.wikimedia.org/d/000000526/aqs?orgId=1 [19:18:26] seems more sustained traffic [19:18:52] we definitely need to set up better limits on what data and how much, perhaps by doing some estimation when we get the date range in AQS [19:18:55] I'll make a task [19:19:09] yes [19:19:41] we have historical AQS queries- we can analyze [19:19:56] we also need an access log on file [19:20:21] And setting up better limits is definitely needed - There is a task, it's been one of nuria's ask for a long time [19:20:32] yep I don't see anymore the UA [19:20:47] they didn't answer but I am sure it was legit traffic [19:20:49] elukey: maybe your email worked? [19:20:53] 10Analytics: Make sure pageview API limits are well documented - https://phabricator.wikimedia.org/T261681 (10Milimetric) p:05Medium→03High This should be re-prioritized to high, and perhaps the parent task should be reopened and revisited because we're seeing service disruption caused by single UAs doing ei... [19:21:24] the task was closed, and this was a subtask ^ we can look Monday and discuss [19:21:33] joal: hope so, I asked to reduce the time window of the queries since we were a little under pressure [19:21:50] ah yes they answered! \o/ [19:22:32] awqesome [19:22:50] they said that they are going to halve the time window, is it still too much? [19:23:16] it was 2010 -> 2021 [19:23:21] so I guess 5y now [19:23:30] I am going to add internal@ to my reply [19:23:33] so people can follow up in case [19:23:34] elukey: we have 5y of data loaded [19:24:02] so even if 10 years are requested, we only serve 5 - and this was too much given the query rate [19:24:16] perfect, I'll ask 1y max? [19:24:47] hm - we should ask for less QPS [19:25:09] that as well yes [19:25:34] They were at 300 QPS - I supsect the reason for which we didn't throttle is becasue the hits are distributed on mulitple IPS, but that's too much [19:26:07] interestink! [19:26:49] Let's ask for 100QPS max - I think that with that level of QPS (distributed) we can stand the 5y data service [19:28:38] elukey: does that make sense ? --^ [19:30:30] joal: I also asked to reduce the time window if possible, but it is good that they answer so we can tune the rate of queries [19:30:38] I just sent the reply with internal@ copied [19:31:15] ack elukey - thanks a milion [19:31:35] <3 [19:33:12] 10Analytics, 10Event-Platform: eventgate-wikimedia should emit metrics about validation errors - https://phabricator.wikimedia.org/T257237 (10Ottomata) Hm, we might be able to do this incrementally. service-runner will let us configure multiple metrics clients. We can keep the existing statsd -> prometheues-... [19:33:14] 10Analytics, 10Product-Analytics: Superset presto error: Failed to list directory: hdfs://analytics-hadoop/wmf/data/event/... - https://phabricator.wikimedia.org/T272741 (10kzimmerman) [19:33:33] Leaving to have diner - Will come back and ahve a look later [19:34:51] a-team: did you all get the email? [19:34:59] yessir [19:35:12] ack, I am going afk and will check later :) [19:35:19] thank you elukey :] [19:36:31] yep, thanks [19:38:57] 10Analytics, 10Event-Platform: eventgate-wikimedia should emit metrics about validation errors - https://phabricator.wikimedia.org/T257237 (10Ottomata) Or, I could probably just get this working with the existent service-runner statsd + prometheus-statsd-exporter config, and just define a mapping from statsd -... [19:58:50] 10Analytics-Radar, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data: mw.user.generateRandomSessionId should return a UUID - https://phabricator.wikimedia.org/T266813 (10Mholloway) a:05Mholloway→03None [20:20:19] hmm, they are going at it again, it's going up a lot [20:20:25] https://grafana.wikimedia.org/d/000000526/aqs?orgId=1&refresh=1m [20:20:58] lots of 404 now [20:23:34] They did reduce the window to 2 years, which is good [20:24:23] Latency is higher but not too high as of yet [20:25:09] aha, p99 is still flat [20:52:28] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: eventgate-wikimedia should emit metrics about validation errors - https://phabricator.wikimedia.org/T257237 (10Ottomata) a:03Ottomata [21:06:49] hm - QPS is still high as well - I assume they've splitted their requests by bands of 2 years, and ranges 2010-2012 and 2012-2014 return 4040 as we have data from 2015 onward [21:08:56] ottomata: I'm still enjoying thinking about kafka for storage (rather than just a stream / queue) ;) I guess if it were storage, dump generation becomes pretty easy? (thinking wikidata.org current revisions) which would just all be json in a stream [21:08:58] The current levels seem sustainable - Will log out - see you folks - have a good weekend [22:13:06] addshore: yeah, i think that would be very useful for the WDQS updater that dcausse and search team is working on [22:13:39] right now they consume revision-create [22:13:45] but still have to grab the content from the api [22:14:21] Yeah, they grab the RDF, so slightly different to the JSON case, but having multiple streams would also be fine, and we also have to make rdf dumps [23:50:03] the rdf comes from the MW API?