#pulp-meeting log

15:31:21 <fao89> #startmeeting Pulp Triage 2020-12-01
15:31:21 <fao89> #info fao89 has joined triage
15:31:21 <fao89> !start
15:31:21 <pulpbot> Meeting started Tue Dec  1 15:31:21 2020 UTC.  The chair is fao89. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:31:21 <pulpbot> Useful Commands: #action #agreed #help #info #idea #link #topic.
15:31:21 <pulpbot> The meeting name has been set to 'pulp_triage_2020-12-01'
15:31:21 <pulpbot> fao89: fao89 has joined triage
15:31:28 <fao89> Open floor!
15:31:46 <bmbouter> #info bmbouter has joined triage
15:31:46 <bmbouter> !here
15:31:48 <pulpbot> bmbouter: bmbouter has joined triage
15:31:50 <ttereshc> #info ttereshc has joined triage
15:31:50 <ttereshc> !here
15:31:50 <pulpbot> ttereshc: ttereshc has joined triage
15:31:56 <fao89> no items on the agenda so far
15:32:08 <daviddavis> #info daviddavis has joined triage
15:32:08 <daviddavis> !here
15:32:08 <pulpbot> daviddavis: daviddavis has joined triage
15:32:31 <fao89> should we start triage?
15:32:51 <daviddavis> I think so
15:33:23 <fao89> !next
15:33:24 <fao89> #topic https://pulp.plan.io/issues/7909
15:33:24 <pulpbot> fao89: 7 issues left to triage: 7909, 7908, 7907, 7904, 7876, 7857, 5502
15:33:25 <pulpbot> RM 7909 - ipanova@redhat.com - POST - Downloader map from repair feature contains only core downloaders
15:33:26 <pulpbot> https://pulp.plan.io/issues/7909
15:33:32 <ipanova> #info ipanova has joined triage
15:33:32 <ipanova> !here
15:33:32 <pulpbot> ipanova: ipanova has joined triage
15:33:49 <fao89> #idea Proposed for #7909: accept and add to sprint
15:33:49 <fao89> !propose other accept and add to sprint
15:33:49 <pulpbot> fao89: Proposed for #7909: accept and add to sprint
15:34:00 <ipanova> +1
15:34:09 <ttereshc> +1
15:34:19 <fao89> #agreed accept and add to sprint
15:34:19 <fao89> !accept
15:34:19 <pulpbot> fao89: Current proposal accepted: accept and add to sprint
15:34:20 <fao89> #topic https://pulp.plan.io/issues/7908
15:34:20 <pulpbot> fao89: 6 issues left to triage: 7908, 7907, 7904, 7876, 7857, 5502
15:34:21 <pulpbot> RM 7908 - ggainey - NEW - Make sure all exceptions live in pulpcore.plugin.exceptions
15:34:22 <pulpbot> https://pulp.plan.io/issues/7908
15:34:33 <ggainey> #info ggainey has joined triage
15:34:33 <ggainey> !here
15:34:33 <pulpbot> ggainey: ggainey has joined triage
15:34:48 <mikedep333> #info mikedep333 has joined triage
15:34:48 <mikedep333> !here
15:34:49 <pulpbot> mikedep333: mikedep333 has joined triage
15:35:03 <fao89> story?
15:35:05 <ttereshc> why deprecation is needed?
15:35:05 <ggainey> 7908 is to clean up some misplaced exceptions - it'll require a deprecation cycle to complete cleanly
15:35:25 <ggainey> ttereshc: because we're moving an exception, anything using it where it is now would break
15:35:46 <ggainey> fao89: maybe task?
15:35:48 <ttereshc> plugins are not allowed to use anything outside plugin api
15:35:56 <ttereshc> is it moving within plugin api?
15:36:02 <fao89> #idea Proposed for #7908: convert to task
15:36:02 <fao89> !propose other convert to task
15:36:02 <pulpbot> fao89: Proposed for #7908: convert to task
15:37:05 <ggainey> ttereshc: that's a good point, I think the one exception I *know* is involved, shoud be in plugins but isn't currently
15:37:13 <daviddavis> I see at least one UnsupportedDigestValidationError
15:37:21 <daviddavis> it's in pulpcore.plugin.models
15:37:30 <ggainey> ah right, ok
15:37:34 <ggainey> daviddavis++
15:37:34 <pulpbot> ggainey: daviddavis's karma is now 401
15:37:34 <ttereshc> I'm all for moving it into one place
15:37:39 <ttereshc> just without deprecation
15:38:20 <ggainey> ttereshc: daviddavis put his finger on the problem
15:39:03 <ttereshc> what do you mean by that?
15:39:56 <ggainey> ttereshc: the current location is exposed to plugins in pulpcore.plugins.modles, and it needs to move from there
15:40:00 <ggainey> models even
15:40:07 <ttereshc> ah ok
15:40:16 <daviddavis> so like I'm imagining we add the exception(s) to their final place in 3.9 and then remove them from their old place in 3.10
15:40:35 <ggainey> daviddavis: yupyup, sounds right
15:40:46 <ttereshc> yeah
15:40:48 <daviddavis> woudl you be able to do this this week? https://pulp.plan.io/issues/7908
15:41:04 <ggainey> daviddavis: yup, it's on my list for 'asap'
15:41:11 <daviddavis> ggainey++
15:41:11 <pulpbot> daviddavis: ggainey's karma is now 64
15:41:16 <ggainey> precisely so it can get into 3.9
15:41:26 <daviddavis> !propose convert to task and add to sprint
15:41:26 <pulpbot> daviddavis: Error: "propose" is not a valid command.
15:41:32 <daviddavis> I'll add the 3.9 milestone
15:41:36 <ggainey> thanks
15:41:56 <bmbouter> why not as the pr introducing the new exception?
15:41:57 <fao89> #idea Proposed for #7908: convert to task and add to sprint
15:41:57 <fao89> !propose other convert to task and add to sprint
15:41:57 <pulpbot> fao89: Proposed for #7908: convert to task and add to sprint
15:42:20 <ggainey> concur
15:42:22 <bmbouter> or maybe this is to move the other ones?
15:42:43 <ipanova> bmbouter: the other ones as well
15:42:43 <bmbouter> or maybe the value of this other issue is to have a plugin writer release note introducing it?
15:42:58 <bmbouter> that makes sense, ty
15:43:35 <fao89> #agreed convert to task and add to sprint
15:43:35 <fao89> !accept
15:43:35 <pulpbot> fao89: Current proposal accepted: convert to task and add to sprint
15:43:36 <fao89> #topic https://pulp.plan.io/issues/7907
15:43:37 <pulpbot> fao89: 5 issues left to triage: 7907, 7904, 7876, 7857, 5502
15:43:38 <pulpbot> RM 7907 - osapryki - NEW - Failed curate_synclist_repository task did not clean up properly resource reservations
15:43:39 <pulpbot> https://pulp.plan.io/issues/7907
15:44:26 <bmbouter> oh yeah I got contacted on slack this morning about this one
15:44:43 <ggainey> ow, painful
15:44:56 <ggainey> (the issue, not necessarily slack-contact :) )
15:44:59 <fao89> accept or accept and add?
15:45:42 <daviddavis> I lean towards accept and add. maybe even high severity?
15:45:49 <daviddavis> this puts the system in an unusable state
15:45:53 <bmbouter> agreed, but the issue is what to dooooo
15:46:04 <fao89> #idea Proposed for #7907: accept and add to sprint
15:46:04 <fao89> !propose other accept and add to sprint
15:46:04 <pulpbot> fao89: Proposed for #7907: accept and add to sprint
15:46:05 <dkliban> yeah
15:46:07 <bmbouter> we know that redis queues are unstable
15:46:08 <dkliban> #info dkliban has joined triage
15:46:08 <dkliban> !here
15:46:08 <pulpbot> dkliban: dkliban has joined triage
15:46:18 <bmbouter> this is not new information
15:46:34 <ipanova> apparently this happens only when there is 1 worker
15:46:45 <ttereshc> any cleanup we can do on task failure? or does it happen at other point?
15:46:55 <bmbouter> we know that RQ doesn't support clustered redis to guard against redis failure
15:47:03 <fao89> should we make workers=2 as default?
15:47:05 <bmbouter> the issue is the task has not failed...
15:47:13 <bmbouter> worker count is not part of the issue I believe
15:47:20 <ttereshc> I guess the full blockage is easier to reproduce with one worker
15:47:29 <ttereshc> but eventually all can be blocked probably
15:47:31 <bmbouter> true
15:48:07 <bmbouter> so here's the issue ... redis fails and the task is lost from redis, the worker stays running and heartbeating so pulp can neve rknow that there was a prolbem
15:48:13 <bmbouter> the heartbeats flow to postgresql
15:48:48 <bmbouter> this is one main motivation why I think we should eliminate the resource manager and move to tasking recorded entirely in postgresql
15:49:11 <bmbouter> because our correctness is "split" across two systems, and note RQ doesn't support clustered redis so redis is a single point of failure
15:49:29 <bmbouter> so one of those systems is unreliable and that is beyond our control (or influence)
15:49:50 <dkliban> makes sense
15:50:00 <bmbouter> fwiw I think this is a dupe of https://pulp.plan.io/issues/7386
15:50:42 <bmbouter> so this is triage and we can move on but the thing is we can't just "put it on the sprint without more action"
15:50:57 <ttereshc> can we periodically check with redis if the task is there and act accordingly? It's not the solution but maybe a less invasive workaround while we have a resource_manager.
15:51:46 <bmbouter> we could have the resource manager probably do checking
15:51:57 <bmbouter> but then we'll get another bug I suspect "pulp mysteriously cancels my tasks"
15:52:23 <ttereshc> heh, yeah
15:52:33 <bmbouter> the args and kwargs live in the redis task alone so we can't redispatch
15:52:37 <ipanova> let's just triage it for now, and have more discussion on the issue?
15:52:54 <ggainey> sounds fair
15:52:54 <bmbouter> the problem is we're getting "stern talkings to" about this
15:53:05 <dalley> bmbouter, an alternative is have the heartbeats record in redis
15:53:25 <bmbouter> that's possible, but I think that further splits our system though
15:53:40 <dalley> indeed
15:53:51 <bmbouter> I can schedule a 30 min call to discuss if folks would want to join
15:54:00 <bmbouter> or discuss on the issue
15:54:21 <bmbouter> the only viable road I see is resource manager elimination and removal of redis from the architecture
15:54:30 <dalley> I don't have any objection to a queue design based on postgresql, we don't have high throughput requirements (where high is measured in terms of tasks/second)
15:54:31 <bmbouter> I can sufficiently motivate this on the issue I guess would be a good next step
15:55:00 <bmbouter> let me post on the issue and send out to pulp-dev for feedback, how does that sound?
15:55:06 <ggainey> +1
15:55:10 <ttereshc> I'm also not against removing the resource manager
15:55:16 <ttereshc> +1 thank you bmbouter
15:55:22 <daviddavis> +1
15:55:31 <ttereshc> I was more thinking what can be done if we need to fix it asap
15:55:31 <bmbouter> yeah I perceive that, the proposal of a less invasive workaround is a good one
15:55:53 <ipanova> +1 to also have a call a talk about the resource manager elimination
15:56:06 <bmbouter> I will include the workaround in my post also, perhaps we'll do both
15:56:20 <bmbouter> and schedule a call following pulp-dev posting
15:56:31 <ipanova> ty
15:56:39 <fao89> which action should I take? accept? skip? accept and add?
15:56:48 <bmbouter> accept I prio and I'll post on it
15:57:02 <bmbouter> "accept w/ high prio" that should have said...
15:57:05 <dalley> minorly related to https://pulp.plan.io/issues/4343 also perhaps
15:57:18 <fao89> #idea Proposed for #7907: accept with high prio
15:57:18 <fao89> !propose other accept with high prio
15:57:18 <pulpbot> fao89: Proposed for #7907: accept with high prio
15:57:25 <fao89> #agreed accept with high prio
15:57:25 <fao89> !accept
15:57:25 <pulpbot> fao89: Current proposal accepted: accept with high prio
15:57:26 <fao89> #topic https://pulp.plan.io/issues/7904
15:57:27 <pulpbot> fao89: 4 issues left to triage: 7904, 7876, 7857, 5502
15:57:28 <pulpbot> RM 7904 - ggainey - NEW - PulpImport can deadlock when importing Centos*-base and app-stream in one import file
15:57:29 <pulpbot> https://pulp.plan.io/issues/7904
15:57:47 <daviddavis> accept and add to sprint I think
15:57:51 <ggainey> I hit this yesterday -opened the issue to record the scenario and failure-output
15:57:58 <ggainey> aye
15:58:12 <fao89> #idea Proposed for #7904: accept and add to sprint
15:58:12 <fao89> !propose other accept and add to sprint
15:58:12 <pulpbot> fao89: Proposed for #7904: accept and add to sprint
15:58:24 <fao89> #agreed accept and add to sprint
15:58:24 <fao89> !accept
15:58:24 <pulpbot> fao89: Current proposal accepted: accept and add to sprint
15:58:25 <fao89> #topic https://pulp.plan.io/issues/7876
15:58:25 <pulpbot> fao89: 3 issues left to triage: 7876, 7857, 5502
15:58:26 <pulpbot> RM 7876 - adam.winberg@smhi.se - NEW - NoneType' object has no attribute 'pk'
15:58:27 <pulpbot> https://pulp.plan.io/issues/7876
15:59:11 <ttereshc> we discussed this one last week at pulpcore meeting and agreed to fix in pulpcore
15:59:33 <ttereshc> the issue can be reproduced using the migration plugin
15:59:34 <fao89> #idea Proposed for #7876: Leave the issue as-is, accepting its current state.
15:59:34 <fao89> !propose accept
15:59:34 <pulpbot> fao89: Proposed for #7876: Leave the issue as-is, accepting its current state.
15:59:43 <dkliban> +1
15:59:47 <ggainey> +1
15:59:56 <dkliban> ttereshc: add to sprint?
16:00:49 <ipanova> +1 not against adding to sprint
16:00:59 <ttereshc> can be
16:01:04 <ipanova> the fix should be small and easy if i remember correctly
16:01:16 <ttereshc> it's not the highest priority but need to be fixed
16:01:22 <fao89> #idea Proposed for #7876: accept and add to sprint
16:01:22 <fao89> !propose other accept and add to sprint
16:01:22 <pulpbot> fao89: Proposed for #7876: accept and add to sprint
16:01:26 <fao89> #agreed accept and add to sprint
16:01:26 <fao89> !accept
16:01:27 <pulpbot> fao89: Current proposal accepted: accept and add to sprint
16:01:27 <fao89> #topic https://pulp.plan.io/issues/7857
16:01:28 <pulpbot> fao89: 2 issues left to triage: 7857, 5502
16:01:29 <pulpbot> RM 7857 - jsherril@redhat.com - NEW - 502 proxy error when trying to yum install a package
16:01:30 <pulpbot> https://pulp.plan.io/issues/7857
16:03:14 <ggainey> not a lot to go on in the issue
16:03:37 <fao89> we skipped it last time, but I don't remember if we had an AI
16:04:57 <fao89> skip again?
16:06:03 <dkliban> could this be related to the django cleanup integration we had?
16:06:06 <ttereshc> we can close it asking to provide more info to reproduce if someone encounters it again
16:06:20 <ttereshc> dkliban, it was introduced in 3.7
16:06:23 <dkliban> gotcha
16:06:24 <ttereshc> we ruled it out last time
16:06:45 <dkliban> yeah ... let's close it out for now
16:07:04 <ttereshc> fao89, I can do it if you want
16:07:20 <fao89> please, do it
16:07:24 <ipanova> i found in the logs that we meant to ping jsherrill and ask more info about setup
16:07:49 <ttereshc> we can close and still do it I think
16:07:50 <fao89> #idea Proposed for #7857: close and ask for more info
16:07:50 <fao89> !propose other close and ask for more info
16:07:50 <pulpbot> fao89: Proposed for #7857: close and ask for more info
16:07:59 <ipanova> ok
16:08:26 <fao89> #agreed close and ask for more info
16:08:26 <fao89> !accept
16:08:26 <pulpbot> fao89: Current proposal accepted: close and ask for more info
16:08:27 <pulpbot> fao89: 1 issues left to triage: 5502
16:08:27 <fao89> #topic https://pulp.plan.io/issues/5502
16:08:28 <pulpbot> RM 5502 - mihai.ibanescu@gmail.com - ASSIGNED - worker fails to mark a task as failed when critical conditions are encountered
16:08:29 <pulpbot> https://pulp.plan.io/issues/5502
16:09:10 <fao89> bmbouter: ^
16:09:17 <ipanova> this already has pr
16:09:50 <ipanova> just need to update the commit message
16:09:56 <ipanova> let's triage and put on the sprint
16:10:08 * bmbouter looks
16:10:23 <fao89> #idea Proposed for #5502: accept and add to sprint
16:10:23 <fao89> !propose other accept and add to sprint
16:10:23 <pulpbot> fao89: Proposed for #5502: accept and add to sprint
16:10:29 <bmbouter> yes this PR is open
16:10:52 <ttereshc> +1
16:10:53 <fao89> #agreed accept and add to sprint
16:10:53 <fao89> !accept
16:10:53 <pulpbot> fao89: Current proposal accepted: accept and add to sprint
16:10:54 <pulpbot> fao89: No issues to triage.
16:10:54 <bmbouter> I need to add a changelog to it
16:10:57 <bmbouter> +1
16:11:11 <fao89> #endmeeting