A couple of questions - Printable Version

A couple of questions - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: A couple of questions (/thread-38333.html)

A couple of questions - Led_Zeppelin - Sep-29-2022

I have a few questions about my next step in the analysis. I have attached the page in question which moves label data 10 steps ahead. Since the data occurs
once every minute for 220320 minutes obviously certain changes in the original DataFrame must be made. This method to me seems a long way around and
just a little confusing.

The first part is easiest

IndexData_org=pdDataFrame(columns='Index', 'sensor_01', 'sensor_02', 'sensor_03', 'sensor_04', 'sensor_06', 'sensor_10', 'sensor_11', 'sensor_12', 'sensor_58', 'sensor_40', 'machine_status'])

Now I am aware that there is an easier way to do what the above Python code does. I have used it many times. The next set of Python lines is what I am confused on.

indexData_org['Index']=range(IndexData.shape.[0] IndexData_org['sensor_01'] = indexData['sensor_01'] IndexData_org['sensor_02'] = indexData['sensor_02'] IndexData_org['sensor_03'] = indexData['sensor_03'] IndexData_org['sensor_04'] = indexData['sensor_04'] IndexData_org['sensor_06'] = indexData['sensor_06'] IndexData_org['sensor_10'] = indexData['sensor_10'] IndexData_org['sensor_11'] = indexData['sensor_11'] IndexData_org['sensor_12'] = indexData['sensor_12] IndexData_org['sensor_38'] = indexData['sensor_38'] IndexData_org['sensor_40'] = indexData['sensor[40']

Now I know there is a better way to do this, but I just state this because it leads into the next section which I do have questions on.

for i in tqdm(range(indexData.shape[0]-10) indexData.org[machine_status'][i] =indexData[machine_status][i+10]

Now I believe that this is correct, but it just seems a long way around to remove the first ten data points from the DataFrame. This is advancing the data by 10 minutes in order to make a prediction 10 minutes in advance.

There must surely be an easier way than going through all of the data points.

Anyway, that is my first question. Any help appreciated. There just must be a simpler way.

Please note that I have attached the pdf file which shows all of this.

Respectfully,

LZ

RE: A couple of questions - deanhystad - Sep-30-2022

Is the purpose of this to insert "Index" as column 0 of the dataframe?

IndexData_org=pdDataFrame(columns='Index', 'sensor_01', 'sensor_02', 'sensor_03', 'sensor_04', 'sensor_06', 'sensor_10', 'sensor_11', 'sensor_12', 'sensor_58', 'sensor_40', 'machine_status']) indexData_org['Index']=range(IndexData.shape.[0] IndexData_org['sensor_01'] = indexData['sensor_01'] IndexData_org['sensor_02'] = indexData['sensor_02'] ...

To insert a column, use insert(). You could also create a new dataframe with the index column and join it with the original dataframe.

import pandas as pd # Make some dummy data df = pd.DataFrame({i: [j * i for j in range(5)] for i in range(1, 5)}) # Reorganize columns and insert index column at 0 df2 = df[[1, 3, 2]] df2.insert(0, "Index", range(len(df))) print("Using insert") print(df2) # Reorganize columns and join with index dataframe df3 = pd.DataFrame(range(len(df)), columns=["Index"]).join(df[[1, 3, 2]]) print("\nUsing join") print(df3)

Output:
Using insert Index 1 3 2 0 0 0 0 0 1 1 1 3 2 2 2 2 6 4 3 3 3 9 6 4 4 4 12 8 Using join Index 1 3 2 0 0 0 0 0 1 1 1 3 2 2 2 2 6 4 3 3 3 9 6 4 4 4 12 8

For the second part of the question, you could use slices or drop().

import pandas as pd # Make some dummy data df = pd.DataFrame({i: [j * i for j in range(5)] for i in range(1, 5)}) print("Original") print(df) df2 = df[2:].reset_index(drop=True) print("\nGet slice that skips first two rows") print(df2) df3 = df.drop(range(2)).reset_index(drop=True) print("\nDrop first two rows") print(df3)

Output:
Original 1 2 3 4 0 0 0 0 0 1 1 2 3 4 2 2 4 6 8 3 3 6 9 12 4 4 8 12 16 Get slice that skips first two rows 1 2 3 4 0 2 4 6 8 1 3 6 9 12 2 4 8 12 16 Drop first two rows 1 2 3 4 0 2 4 6 8 1 3 6 9 12 2 4 8 12 16

Are you calculating the rolling mean as described in your attached PDF? You should use the rolling() function instead. I mentioned using rolling() in a reply to another of your posts: https://python-forum.io/thread-38093-post-161222.html#pid161222

After calling the rolling function you would remove the NaN's at the start of the averaged data and reset the index. One or two lines of python code and sooooo much faster than what you are doing.

As much as you are using pandas you should really take a class. I would not take on this big a project without knowing my tools top to bottom and inside out. You really don't have all that much data. 220,000 rows is puny for pandas. I think most of your 40 minute processing time is from doing things in inefficient ways (using loops instead of vectorizing) or doing things that don't need to be done at all (generating many extra data frames that are thrown away). I wouldn't be surprised if 40 minutes could be reduced to 4 seconds.

I wrote a little test to compare DataFrame.rolling() against your method for computing the rolling average. I ran it for a dataframe with 4 columns and 1000 rows.

import pandas as pd import time # Make some dummy data df = pd.DataFrame({i: [j * i for j in range(1000)] for i in range(1, 5)}) start = time.time() df2 = df.rolling(10).mean()[9:] print("Vectorize", time.time() - start) start = time.time() df3 = pd.DataFrame(columns=["Index", 1, 2, 3, 4]) df3["Index"] = range(len(df) - 9) for row in range(len(df3)): df3[1][row] = df[1][row : row + 10].mean() df3[2][row] = df[2][row : row + 10].mean() df3[3][row] = df[3][row : row + 10].mean() df3[4][row] = df[4][row : row + 10].mean() print("Loop", time.time() - start)

Output:
Vectorize 0.001995086669921875 Loop 8.190867900848389

Using rolling is 4105 times faster! I ran the vectorize version with 200230 rows which took 0.02789 seconds. I estimate the loop version would take 28 minutes. I want to change my runtime guess from 4 seconds down to 1.

When running the loop version I got this message showing up once per column.

Error:
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df3[1][row] = df[1][row : row + 10].mean()

Is this message displayed when you run your code?

Pandas does not like this: df1[column][row index] = df[column][row index]. It is called chained assignment, and depending on what you are doing, it might not work. It also slows things down. You can read about the concerns here:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-view-versus-copy

This forum uses Lukasz Tkacz MyBB addons.

Forum use Krzysztof "Supryk" Supryczynski addons.