Hello everybody,
With a batch-script I export mutliple text-files to a specific folder where I want to merge them into one.
Therefore I used the type-command in windows cmd which worked fine. But after comparing the files I noticed that the order is wrong. I found out the CMD sorts them differently (file_1,file_10,file_11,file_2,file_3,file_33) instead of how I'm doing it (file_1,file_2,file_3,file_10,file_11,file_33).
Now I would like to use a Python Script do merge them together in the way I would sort them (file_1,file_2,file_3,file_10,file_11,file_33).
This is what I have so far but I don't know how to go on:
#!/usr/bin/env python3 import os import re folder_path = "../Outputs/" for root, dirs, files in os.walk(folder_path, topdown = False): for name in files: if name.endswith(".txt"): file_name = os.path.join(root, name)Edit: All my files start with string "file_" and an ongoing number. It differs how many files there will be after I exported them so I can't set it up manually.
Why are you calling os.walk() if you want to merge the files only in one folder?
#!/usr/bin/env python3 from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object outputs = Path.home() / "Outputs" # print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line, end="") # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc...Relevant documentation:
Using the program (I named it randomly searchp.py):
Output:
[andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz [andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz 666 123 000
I created the
Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.
If you have control of the file names the easiest solution is to generate file names that can be sorted properly: file_000, file_001, file_010. file_100
(Aug-19-2021, 02:08 PM)DeaD_EyE Wrote: [ -> ]#!/usr/bin/env python3 from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object outputs = Path.home() / "Outputs" # print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line, end="") # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc...Relevant documentation:
Using the program (I named it randomly searchp.py):
Output:
[andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz [andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz 666 123 000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.
Thanks for your reply. So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
(Aug-19-2021, 05:02 PM)AlphaInc Wrote: [ -> ]So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object #outputs = Path.home() / "Outputs" outputs = Path(r'G:\div_code\answer') #print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) #print(sorted_outputs) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects lines = [] for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): print(path) # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line) lines.append(line.strip()) #zf.writestr(str(path), f) # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc... with open('lines.txt', 'w') as f: f.write('\n'.join(lines))Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4
(Aug-19-2021, 10:12 PM)snippsat Wrote: [ -> ] (Aug-19-2021, 05:02 PM)AlphaInc Wrote: [ -> ]So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
On Windows the home path will be C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object #outputs = Path.home() / "Outputs" outputs = Path(r'G:\div_code\answer') #print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) #print(sorted_outputs) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects lines = [] for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): print(path) # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line) lines.append(line.strip()) #zf.writestr(str(path), f) # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc... with open('lines.txt', 'w') as f: f.write('\n'.join(lines))Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4
Thanks also for your help. I can see that the Script process all my files but it does not create an output-file
Edit: My Bad, just looked after the wrong name. I found it. Thank you.
(Aug-20-2021, 05:48 AM)AlphaInc Wrote: [ -> ] (Aug-19-2021, 10:12 PM)snippsat Wrote: [ -> ]On Windows the home path will be C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object #outputs = Path.home() / "Outputs" outputs = Path(r'G:\div_code\answer') #print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) #print(sorted_outputs) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects lines = [] for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): print(path) # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line) lines.append(line.strip()) #zf.writestr(str(path), f) # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc... with open('lines.txt', 'w') as f: f.write('\n'.join(lines))Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4
Thanks also for your help. I can see that the Script process all my files but it does not create an output-file
Edit: My Bad, just looked after the wrong name. I found it. Thank you.
Sorry once again but I get an error:
Traceback (most recent call last): File "FileProcessing_11.py", line 30, in <module> sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) File "FileProcessing_11.py", line 13, in sort_by_int return int(path.stem.split("_", maxsplit=1)[1]) IndexError: list index out of rangeCould it be because the files got spaces in it?
(Aug-20-2021, 08:12 AM)AlphaInc Wrote: [ -> ]Could it be because the files got spaces in it?
I couple of tips how you troubleshoot this.
def sort_by_int(path): print(path) print(path.stem) # Path has the stem attribute, which is
Bye adding this you see what happen before error.
Test.
>>> f = Path(r'G:\div_code\answer\file_33.txt') >>> f.stem 'file_33' >>> f.stem.split('_', maxsplit=1) ['file', '33'] >>> f.stem.split('_', maxsplit=1)[1] '33'Make your error.
>>> f = Path(r'G:\div_code\answer\file33.txt') >>> f.stem 'file33' >>> f.stem.split('_', maxsplit=1) ['file33'] >>> f.stem.split('_', maxsplit=1)[1] Traceback (most recent call last): File "<interactive input>", line 1, in <module> IndexError: list index out of rangeIn your first post all files you show file_1,file_2,file_3,file_10...ect all had a
_,then it should work.
With those
print() or add
repr()(see all like eg space) you will se all files input before the error.
print(repr(path)) print(repr(path.stem))