Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add movingaverage function. #13923

Closed
wants to merge 4 commits into from
Closed

ENH: Add movingaverage function. #13923

wants to merge 4 commits into from

Conversation

NGeorgescu
Copy link

@NGeorgescu NGeorgescu commented Jul 6, 2019

Added a moving average function, which is in keeping with np.mean and np.sum functions which operate on arrays. Also has the 'axis' kwarg option. Works for arrays of any size/depth and is intuitive in syntax.

@NGeorgescu
Copy link
Author

Tried looking into the details of failed checks. Not sure why 'shippable' failed or what that means. Any help with the error checks is appreciated, along with any help regarding syntactic standards in keeping with the repo standards.

Thanks.

@mattip
Copy link
Member

mattip commented Jul 7, 2019

Isn't this the same as a convolution with [1, 1, 1] / 3 (with some decisions needed about offsets?

In general, enhancements and new APIs need to hit the mailing list for discussion. We are very conservative about adding new functionality. After there is consensus, this needs tests for success and for failing edge cases.

@NGeorgescu
Copy link
Author

NGeorgescu commented Jul 8, 2019

Isn't this the same as a convolution with [1, 1, 1] / 3 (with some decisions needed about offsets?

Not quite, because np.convolve only accepts 1D arrays. So in order to get np.convolve to behave in this way, you need to use a list comprehension. And if you then want axis=x, you need to then turn it into an array, and do the swapaxes, and then convolve for each i (with one comprehension at each depth) and then swapaxes again. At that point, you might as well just copy this solution and implement it yourself.

In general, enhancements and new APIs need to hit the mailing list for discussion. We are very conservative about adding new functionality.

100% understand. Thought I would write the code. The magic line with the np.sum in it is non-obvious. Furthermore there seems to be a lot of people trying to figure out moving averages in numpy/pandas from my googling, everyone has a different solution, and I think I'm the only one with one successful to an arbitrary depth. I also considered publishing this as a single-def package but I want to avoid npm-style implosions.

After there is consensus, this needs tests for success and for failing edge cases.

So the way that this algorithm works is that, assuming you have [a,b,c,d,e...z] for a moving average of n=3, it creates one array [a,b,c, ... ,w,x], one [b,c,d,...., x,y] and one [c,d,e,.... ,y,z] (3 in total) and then sums them together and divides by n. So the first element is (a+b+c)/3, the second (b+c+d)/3, ... the last is (x, y, z)/3. So this works to any depth since a itself can be an array. The only real edge cases are ragged arrays or any other time np.sum would fail that I can think of. But I'm interested to see what others think.

That said thanks for your help and consideration.

@eric-wieser
Copy link
Member

eric-wieser commented Jul 8, 2019

If we're going to add something like this to numpy, I think it should be spelt something like np.average(np.sliding_window(x, len=l. axis=a), axis=a).

We already have numerous requests for sliding_window, and just need to pin down its semantics.

@NGeorgescu
Copy link
Author

NGeorgescu commented Jul 8, 2019

If we're going to add something like this to numpy, I think it should be spelt something like np.average(np.sliding_window(x, len=l. axis=a), axis=a).
We already have numerous requests for sliding_window, and just need to pin down its semantics.

In the wolfram language, for instance, MovingAverage[] is it's own separate function (but I numpified it, like np.std(), np.cumsum(), etc.). This seems the most logical because it will be always have a specified axis on the np.average() function, so the average is then a wrapper that requires this clunky kwarg, but I understand not wanting to feature creep, and I agree though sliding_window itself seems like a very useful function (for instance, in determining a maximum value in some moving window, etc.)

It seems a bit clunky to require a kwarg for sliding_window, since it makes no sense without it. It makes sense for like axis=0, for instance, since that's a reasonable default value, but what is the default len if the kwarg isn't included? len=1 and a len the length of the array are both trivial nesting operations. Seems like that's better as an arg than kwarg. I mean, unless there's a compelling counter-example. This works:

def sliding_window(a,n,axis=0):
    a = asarray(a).swapaxes(0,axis)
    a = array([a[m:len(a)-n+1+m] for m in range(n)])
    return a

Wherein sliding_window is a function that takes in an n-dimensional array are returns an (n+1)-dimensional array wherein each member of the axis dimension is a sliding window of the original array along the chosen axis. Then np.average(np.sliding_window(a, n. axis=x), axis=x) will work assuming that both x numbers are the same. (return a.swapaxes(axis,0) in sliding_window if you want the np.average to have a kwarg permanently of axis=0 rather than axis=x).

@eric-wieser
Copy link
Member

Let's not discuss sliding window here. #7753 has some good discussion, and is probably pending a maintainer / mailing list decision. Note that the proposed sliding window does so without making any data copies, making it faster than your approach.

@charris charris changed the title Movingaverage ENH: Add movingaverage function. Jul 8, 2019
@NGeorgescu
Copy link
Author

NGeorgescu commented Jul 9, 2019

There's a "16 months faster" joke in there somewhere considering the last code was in March of last year. :P

@NGeorgescu
Copy link
Author

NGeorgescu commented Jul 9, 2019

Great, then using the solution

def sliding_window_view(arr, shape):
    n = np.array(arr.shape) 
    o = n - shape + 1 # output shape
    strides = arr.strides
    
    new_shape = np.concatenate((o, shape), axis=0)
    new_strides = np.concatenate((strides, strides), axis=0)
    return np.lib.stride_tricks.as_strided(arr ,new_shape, new_strides)

Then the code should be

def mvgavg(a, n, axis=0):
    shape = a.shape-(n-1)*np.eye(a.ndim,dtype='int')[axis]
    window = sliding_window_view(a, shape)
    return np.mean(window,axis=axis).reshape(shape)

If you want to implement a moving average, I'd say that second block I came up with is non-trivial. I really think it should be it's own function then.

@NGeorgescu
Copy link
Author

Timed it, looks like it's about 2-3x faster than my original algorithm for np.random.rand(30000,2000)

Seriously, though, considering that there is still no movement, should we start a new library or like PR someone else's? Why is this so behind?

@seberg
Copy link
Member

seberg commented Jul 9, 2019

One thing about moving averages is that if you do not mind losing a bit of precision, there are much faster solution out there. The same is true for many other "moving" functions. Which does not mean that we should not add that windowing function, as long as the documentation mentions that maybe.

@NGeorgescu
Copy link
Author

Is there a faster algorithm that has the same functionality? What do you mean by a “losing bit of precision”? Binning?

@seberg
Copy link
Member

seberg commented Jul 9, 2019

@NGeorgescu
Copy link
Author

NGeorgescu commented Jul 9, 2019

Well that’s an np.cumsum Difference algorithm. It would be the same precision unless I’m misunderstanding what you’re trying to say. I’d be interested to see if it’s faster than using the reindexing by @fnjn. I know people use it for convolutional neural networks and the such because of its speed but I don’t know if it still applies when taking a 1D moving average. The reason people use it is to avoid repeat operations, but I don’t think that advantage applies here.

@NGeorgescu
Copy link
Author

NGeorgescu commented Jul 9, 2019

Well I implemented a cumsum lookup algorithm and it looks like it actually wins for very wide (not along the averaging axis) 2D arrays and for very long averaging schemes! Very nice @seberg!

It's also more understandable than the sliding window. Try it yourself:

Looks like we have a new contender for fastest algorithm, @eric-wieser .


import numpy as np

def sliding_window_view(arr, shape):
    """
    This function was written by fnjn on github and posted to numpy/numpy.
    It takes in an array and a shape, and then gives back an array that is
    essentially moving around a planchette of the described shape over the
    array.
    """
    n = np.array(arr.shape) 
    o = n - shape + 1 # output shape
    strides = arr.strides
    new_shape = np.concatenate((o, shape), axis=0)
    new_strides = np.concatenate((strides, strides), axis=0)
    return np.lib.stride_tricks.as_strided(arr ,new_shape, new_strides)

def mvgavg(a, n, axis=0, weights=None):
    a = np.array(a)
    shape = a.shape-(n-1)*np.eye(a.ndim,dtype='int')[axis]
    window = sliding_window_view(a, shape)
    return np.mean(window,axis=axis).reshape(shape)

def cumsum_mvgavg(a, n, axis=0):
    table = np.cumsum(a.swapaxes(0,axis)
    ,axis=0)/(n)
    try:
        table = np.vstack([[np.zeros(len(table[0]))],table])
    except:
        table = np.array([0,*table])
    return np.swapaxes(table[n:]-table[:len(table)-n],axis,0)


arr = np.random.rand(3000,10)
import time
times_cumsum=[]
times_reindex=[]
for i in range(50):
    t_cumsum = time.time()
    ca = cumsum_mvgavg(arr,50,axis=0)
    t_cumsum = time.time()-t_cumsum

    times_cumsum.append(t_cumsum)
    
    t_reindex= time.time()
    ra = mvgavg(arr,50,axis=0)
    t_reindex = time.time()-t_reindex
    
    times_reindex.append(t_reindex)
    assert np.mean((ca-ra)**2)<10**-6

times_cumsum = np.mean(times_cumsum[2:-2])
times_reindex = np.mean(times_reindex[2:-2])
print([times_cumsum,times_reindex])

@NGeorgescu
Copy link
Author

NGeorgescu commented Jul 9, 2019

For some reason my cumsum_mvgavg doesn't work beyond axis depth 2. Weird. I'll have to figure that out. Edit: fixed.

So it turns out it's a bit more complicated, sometimes cumsum wins, sometimes reindex wins. Cumsum seems to do well when you have a very wide 2D array and a long averaging length. Otherwise, reindex seems to win. But when cumsum wins it wins by a lot. Here's some array shapes, avg length, and times for cumsum and reindex:

(3000,) 50 [0.0006575 0.0002985]
(3000,) 5 [0.000719  0.0001683]
(3000, 10) 50 [0.0004457 0.0016589]
(3000, 10) 5 [0.0005152 0.0005925]
(3000, 10, 20) 50 [0.0110596 0.1357119]
(3000, 10, 20) 5 [0.0116897 0.0194189]

using


def cumsum_mvgavg(a, n, axis=0):
    table = np.cumsum(a.swapaxes(0,axis),axis=0)/n
    try:
        table = np.vstack([[0*table[0]],table])
    except:
        table = np.array([0,*table])
    return np.swapaxes(table[n:]-table[:-n],0,axis)

@seberg
Copy link
Member

seberg commented Jul 12, 2019

@NGeorgescu before you spend even more time on this, I want to make sure that you understand that new functions will have to be passed by the mailing list and that we may have different long term goals, etc. Saying this, I hope that even if this ends up being closed, you are happy to have learned and optimized your own code.

@NGeorgescu
Copy link
Author

Well I mean how long is this email list process? Weeks? Months? ... looks like years from the other PR.

I made a single-function library out of it. If something moves let me know. I updated the PR with the updated code, which includes the fastest algorithm, the sliding window for weighting, as well as a binning function which is orders of magnitude faster but shortens the data. See you at python-moving-averages.

@seberg
Copy link
Member

seberg commented Jul 23, 2019

Sorry, forgot about it, the email process could be quick, if there is a consensus about the name and if it should go in. A library for moving window functions would be a nice solution also. Thank you for starting with an external small package, can help others without the trickier decisions.

Maybe you can just try and send an email to the mailing list assuming you want to push it a bit.

@NGeorgescu
Copy link
Author

Hey @seberg I put in a topic at the nabble boards. It hasn't shown up yet (not sure what the approval process is like).

@seberg
Copy link
Member

seberg commented Aug 1, 2019

@NGeorgescu not sure what is going on there. But we moved the mailing list to be hosted at python, so I think you probably got the wrong place and likely have to sign up for numpy-discussions@python.org.

@NGeorgescu
Copy link
Author

NGeorgescu commented Aug 1, 2019

Oh I got to that through the Read/Search numpy-discussion on this page. How do I post to numpy-discussions?

@seberg
Copy link
Member

seberg commented Aug 1, 2019

Seems that is still up to date, although I am not quite sure you can interact through it, maybe it will show up in a bit. (I once had my emails eaten by python.org, because of a two sided thing. My mailserver did something slightly funny, and python.org did not quite adhere to the standard)

@NGeorgescu
Copy link
Author

NGeorgescu commented Aug 6, 2019

Hey @seberg , it seems that my posts are being eaten by the system. Would you mind posting the comment? It seems you've had successful recent posts there.

ENH: Add movingaverage function.

I opened a Pull Request #13923 to
include this package https://pypi.org/project/mvgavg/ in numpy, along
with the associated sliding window function in this PR
#7753.

Thanks.

@rgommers
Copy link
Member

This PR is unfortunately not going to work. There seems to be a consensus in gh-7753 that this functionality should not be in the main numpy namespace; either we put it in numpy.lib.stride_tricks or leave it out completely (given that scikit-image has this already).

I'd probably trim this PR down a lot and try to get sliding_window in first. There's way too many functions in here to start reviewing in detail.

@NGeorgescu
Copy link
Author

Well sliding window hasn't been moved on in months. Should I put that on the mailing list first or how should I proceed?

@fanjin-z
Copy link

fanjin-z commented Aug 26, 2019

@NGeorgescu I updated the slidingwindow about a month ago according to reviewers suggestions.
No afterwards feedback since then.

@rgommers
Copy link
Member

If you want to move this PR forward, I suggest to put it on top of the sliding_window one, and remove everything except the function you want to add plus tests for it. There's lots of code in here that's not needed.

@NGeorgescu
Copy link
Author

I did that. pip install mvgavg

@rgommers
Copy link
Member

okay, I'll be more explicit in pointing out what needs to go

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, a very quick first pass over the code - there's quite a bit to change. Also, this needs tests.

new_strides = np.concatenate((strides, strides), axis=0)
return np.lib.stride_tricks.as_strided(arr ,new_shape, new_strides)

def pascal(n):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like something we'd want to include

"""returns the nth line of a pascal's triangle"""
return np.array([1]) if n == 1 else np.array([0,*pascal(n-1)])+np.array([*pascal(n-1),0])

def triangle(n):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like something we'd want to include

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An arguement for keeping this somewhere - it somewhat fits in with the other window functions - although if we decide to add to that list, I'd be inclined to create an np.window sub-module, rather than growing the top namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those window functions are considered a mistake, they shouldn't have been added in the first place. Certainly we don't want to add to them. scipy.signal is a much better place for those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we need to update that doc page to recommend the scipy versions then (and the larger library of functions alongside them)

def triangle(n):
return np.array([*range(1,int((n+2)/2))]+[*range(int((n+1)/2),0,-1)])

def quadratic(n):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like something we'd want to include

Width of the moving average
axis: int
Axis along which the moving average is to take place, 0 if unspecified
weights: 'pascal', 'triangle', 'quadratic', or a list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weights need to be generic, not just a couple of random weighting scheme. probably should take a callable of some sort. since the shape of weights depends on the window, it can't just be array_like I think (which would otherwise be preferred, as in np.average).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I grabbed the movingaverage weighting scheme from another language. I'll get rid of the random weights. I agree that's not clean.



#@array_function_dispatch(_mvgavg_dispatcher)
def mvgavg(a, n, axis=0, weights=False, binning=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a readable name: moving_average

If you have any doubts of how a moving average works, try it out with an array of
symbols in sympy.

axis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the next one aren't valid section names, please remove the headers

Binning vs Unbinned
--------------------
Consider the array [a b c d e... ] that is being averaged to the product array,
[A B C D... ] over a distance of 3. If binning = False:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:: for proper rendering at the end of this line

Examples
--------
>>> from mvgavg import mvgavg
>>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious >>>, remove

--------
>>> from mvgavg import mvgavg
>>>
>>> example_array=[[i,(-1)**i] for i in np.arange(5)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call it x. and PEP8 spaces please


"""
if axis == None:
axis = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

axis=None usually means to flatten the array. if you want the default to be axis=0, put it in the signature

@NGeorgescu
Copy link
Author

Thanks! I appreciate you taking the time on this. I'll get to changing. Also, what are the tests and how can they be incorporated? Like what are we testing for? Is it something like moving_average((-1)**np.arange(100),2) should yield the same answer as 0.*np.arange(99) or was there something more specific?

@rgommers
Copy link
Member

what are the tests and how can they be incorporated? Like what are we testing for?

Have a look at the tests for average for comparison: https://github.com/numpy/numpy/blob/master/numpy/lib/tests/test_function_base.py#L282

@eric-wieser
Copy link
Member

I wonder if we should just go down the road of providing a recipe in the docs, rather than actually exposing this function?

Something as simple as:

For example, this function can be used to build a moving average

..code:: python

    def moving_average(a, n, weights=None, axis=-1):
        axis = normalize_axis_index(axis, a.ndim)  # needed to make the + below valid
        return np.average(sliding_window_view(a, (n,), axis=(axis,)), weights, axis=a.ndim+axis)

@NGeorgescu
Copy link
Author

Two reasons:

  1. The unweighted moving average using a cumsum operation is generally faster, esp. for large n, large numbers of axes, and for arrays that are very wide with respect to the averaging axis.
  2. Most other languages have a single-function call, not a multi-line function, e.g. rollmean in R, MovingAverage in Wolfram, etc.

@Qiyu8
Copy link
Member

Qiyu8 commented May 29, 2020

Looks like a controversial PR, @NGeorgescu , you need to put more energy on this if you want to move on, or I will close it for PR management.

@NGeorgescu
Copy link
Author

Yeah so I was thinking about this PR I opened. In hindsight the weights was a feature that no one will use (I think), and if you do well I guess you can custom code your own thing. This simplifies the PR immensely and no longer involves the stride tricks so I shouldn't pull that I should just pull whatever normal dev branch (I forget).

So what's easiest: should I just change this PR or just open a fresh one?

@Qiyu8
Copy link
Member

Qiyu8 commented Jun 1, 2020

It's better to create a new one.

@NGeorgescu
Copy link
Author

Cool Will Do!

@NGeorgescu NGeorgescu closed this Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants