Posts: 56 Threads: 23 Joined: Jul 2021 Aug-19-2021, 01:03 PM (This post was last modified: Jun-24-2022, 10:15 AM by AlphaInc.) Hello everybody, With a batch-script I export mutliple text-files to a specific folder where I want to merge them into one. Therefore I used the type-command in windows cmd which worked fine. But after comparing the files I noticed that the order is wrong. I found out the CMD sorts them differently (file_1,file_10,file_11,file_2,file_3,file_33) instead of how I'm doing it (file_1,file_2,file_3,file_10,file_11,file_33). Now I would like to use a Python Script do merge them together in the way I would sort them (file_1,file_2,file_3,file_10,file_11,file_33). This is what I have so far but I don't know how to go on: #!/usr/bin/env python3 import os import re folder_path = "../Outputs/" for root, dirs, files in os.walk(folder_path, topdown = False): for name in files: if name.endswith(".txt"): file_name = os.path.join(root, name)Edit: All my files start with string "file_" and an ongoing number. It differs how many files there will be after I exported them so I can't set it up manually. Posts: 4,874 Threads: 78 Joined: Jan 2018 Why are you calling os.walk() if you want to merge the files only in one folder? Posts: 2,171 Threads: 12 Joined: May 2017 #!/usr/bin/env python3 from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object outputs = Path.home() / "Outputs" # print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line, end="") # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc...Relevant documentation: Using the program (I named it randomly searchp.py): Output: [andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz [andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz 666 123 000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end. Posts: 6,920 Threads: 22 Joined: Feb 2020 If you have control of the file names the easiest solution is to generate file names that can be sorted properly: file_000, file_001, file_010. file_100 Posts: 56 Threads: 23 Joined: Jul 2021 (Aug-19-2021, 02:08 PM)DeaD_EyE Wrote: #!/usr/bin/env python3 from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object outputs = Path.home() / "Outputs" # print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line, end="") # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc...Relevant documentation: Using the program (I named it randomly searchp.py): Output: [andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz [andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz 666 123 000 I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end. Thanks for your reply. So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ? Posts: 7,398 Threads: 123 Joined: Sep 2016 Aug-19-2021, 10:12 PM (This post was last modified: Aug-19-2021, 10:12 PM by snippsat.) (Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ? On Windows the home path will be C:\Users\<username>\Outputs You can give path to where you have the .txt files,if not want to make this Outputs folder. Eg. outputs = Path(r'G:\div_code\answer') Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows. from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object #outputs = Path.home() / "Outputs" outputs = Path(r'G:\div_code\answer') #print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) #print(sorted_outputs) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects lines = [] for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): print(path) # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line) lines.append(line.strip()) #zf.writestr(str(path), f) # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc... with open('lines.txt', 'w') as f: f.write('\n'.join(lines))Output: G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt: Output: line1 line2 line3 line4 Posts: 56 Threads: 23 Joined: Jul 2021 Aug-20-2021, 05:48 AM (This post was last modified: Aug-20-2021, 05:48 AM by AlphaInc.) (Aug-19-2021, 10:12 PM)snippsat Wrote: (Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ? On Windows the home path will be C:\Users\<username>\Outputs You can give path to where you have the .txt files,if not want to make this Outputs folder. Eg. outputs = Path(r'G:\div_code\answer') Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows. from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object #outputs = Path.home() / "Outputs" outputs = Path(r'G:\div_code\answer') #print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) #print(sorted_outputs) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects lines = [] for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): print(path) # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line) lines.append(line.strip()) #zf.writestr(str(path), f) # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc... with open('lines.txt', 'w') as f: f.write('\n'.join(lines))Output: G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4 lines.txt: Output: line1 line2 line3 line4
Thanks also for your help. I can see that the Script process all my files but it does not create an output-file Edit: My Bad, just looked after the wrong name. I found it. Thank you. Posts: 56 Threads: 23 Joined: Jul 2021 (Aug-20-2021, 05:48 AM)AlphaInc Wrote: (Aug-19-2021, 10:12 PM)snippsat Wrote: On Windows the home path will be C:\Users\<username>\Outputs You can give path to where you have the .txt files,if not want to make this Outputs folder. Eg. outputs = Path(r'G:\div_code\answer') Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows. from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object #outputs = Path.home() / "Outputs" outputs = Path(r'G:\div_code\answer') #print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) #print(sorted_outputs) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects lines = [] for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): print(path) # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line) lines.append(line.strip()) #zf.writestr(str(path), f) # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc... with open('lines.txt', 'w') as f: f.write('\n'.join(lines))Output: G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4 lines.txt: Output: line1 line2 line3 line4
Thanks also for your help. I can see that the Script process all my files but it does not create an output-file Edit: My Bad, just looked after the wrong name. I found it. Thank you. Sorry once again but I get an error: Traceback (most recent call last): File "FileProcessing_11.py", line 30, in <module> sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) File "FileProcessing_11.py", line 13, in sort_by_int return int(path.stem.split("_", maxsplit=1)[1]) IndexError: list index out of rangeCould it be because the files got spaces in it? Posts: 56 Threads: 23 Joined: Jul 2021 (Aug-20-2021, 08:12 AM)AlphaInc Wrote: (Aug-20-2021, 05:48 AM)AlphaInc Wrote: Thanks also for your help. I can see that the Script process all my files but it does not create an output-file Edit: My Bad, just looked after the wrong name. I found it. Thank you. Sorry once again but I get an error: Traceback (most recent call last): File "FileProcessing_11.py", line 30, in <module> sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) File "FileProcessing_11.py", line 13, in sort_by_int return int(path.stem.split("_", maxsplit=1)[1]) IndexError: list index out of rangeCould it be because the files got spaces in it? Posts: 7,398 Threads: 123 Joined: Sep 2016 Aug-20-2021, 10:14 AM (This post was last modified: Aug-20-2021, 10:14 AM by snippsat.) (Aug-20-2021, 08:12 AM)AlphaInc Wrote: Could it be because the files got spaces in it? I couple of tips how you troubleshoot this. def sort_by_int(path): print(path) print(path.stem) # Path has the stem attribute, which is Bye adding this you see what happen before error. Test. >>> f = Path(r'G:\div_code\answer\file_33.txt') >>> f.stem 'file_33' >>> f.stem.split('_', maxsplit=1) ['file', '33'] >>> f.stem.split('_', maxsplit=1)[1] '33'Make your error. >>> f = Path(r'G:\div_code\answer\file33.txt') >>> f.stem 'file33' >>> f.stem.split('_', maxsplit=1) ['file33'] >>> f.stem.split('_', maxsplit=1)[1] Traceback (most recent call last): File "<interactive input>", line 1, in <module> IndexError: list index out of rangeIn your first post all files you show file_1,file_2,file_3,file_10...ect all had a _,then it should work. With those print() or add repr()(see all like eg space) you will se all files input before the error. print(repr(path)) print(repr(path.stem)) |