Fast extraction of a time range from syslog logfile?

Question

I've got a logfile in the standard syslog format. It looks like this, except with hundreds of lines per second:

Jan 11 07:48:46 blahblahblah... Jan 11 07:49:00 blahblahblah... Jan 11 07:50:13 blahblahblah... Jan 11 07:51:22 blahblahblah... Jan 11 07:58:04 blahblahblah...

It doesn't roll at exactly midnight, but it'll never have more than two days in it.

I often have to extract a timeslice from this file. I'd like to write a general-purpose script for this, that I can call like:

$ timegrep 22:30-02:00 /logs/something.log

...and have it pull out the lines from 22:30, onward across the midnight boundary, until 2am the next day.

There are a few caveats:

I don't want to have to bother typing the date(s) on the command line, just the times. The program should be smart enough to figure them out.
The log date format doesn't include the year, so it should guess based on the current year, but nonetheless do the right thing around New Year's Day.
I want it to be fast -- it should use the fact that the lines are in order to seek around in the file and use a binary search.

Before I spend a bunch of time writing this, does it already exist?

Dennis Williamson · Accepted Answer · 2010-06-16 04:12:58Z

Update: I've replaced the original code with an updated version with numerous improvements. Let's call this (actual?)alpha-quality.

This version includes:

command-line option handling
command-line date format validation
some try blocks
line reading moved into a function

Original text:

Well what do you know? "Seek" and ye shall find! Here is a Python program that seeks around in the file and uses a more-or-less binary search. It's considerably faster than that AWK script that other guy wrote.

It's (pre?)alpha-quality. It should have try blocks and input validation and lots of testing and could no doubt be more Pythonic. But here it is for your amusement. Oh, and it's written for Python 2.6.

New code:

#!/usr/bin/env python # -*- coding: utf-8 -*- # timegrep.py by Dennis Williamson 20100113 # in response to http://serverfault.com/questions/101744/fast-extraction-of-a-time-range-from-syslog-logfile # thanks to serverfault user http://serverfault.com/users/1545/mike # for the inspiration # Perform a binary search through a log file to find a range of times # and print the corresponding lines # tested with Python 2.6 # TODO: Make sure that it works if the seek falls in the middle of # the first or last line # TODO: Make sure it's not blind to a line where the sync read falls # exactly at the beginning of the line being searched for and # then gets skipped by the second read # TODO: accept arbitrary date # done: add -l long and -s short options # done: test time format version = "0.01a" import os, sys from stat import * from datetime import date, datetime import re from optparse import OptionParser # Function to read lines from file and extract the date and time def getdata(): """Read a line from a file Return a tuple containing: the date/time in a format such as 'Jan 15 20:14:01' the line itself The last colon and seconds are optional and not handled specially """ try: line = handle.readline(bufsize) except: print("File I/O Error") exit(1) if line == '': print("EOF reached") exit(1) if line[-1] == '\n': line = line.rstrip('\n') else: if len(line) >= bufsize: print("Line length exceeds buffer size") else: print("Missing newline") exit(1) words = line.split(' ') if len(words) >= 3: linedate = words[0] + " " + words[1] + " " + words[2] else: linedate = '' return (linedate, line) # End function getdata() # Set up option handling parser = OptionParser(version = "%prog " + version) parser.usage = "\n\t%prog [options] start-time end-time filename\n\n\ \twhere times are in the form hh:mm[:ss]" parser.description = "Search a log file for a range of times occurring yesterday \ and/or today using the current time to intelligently select the start and end. \ A date may be specified instead. Seconds are optional in time arguments." parser.add_option("-d", "--date", action = "store", dest = "date", default = "", help = "NOT YET IMPLEMENTED. Use the supplied date instead of today.") parser.add_option("-l", "--long", action = "store_true", dest = "longout", default = False, help = "Span the longest possible time range.") parser.add_option("-s", "--short", action = "store_true", dest = "shortout", default = False, help = "Span the shortest possible time range.") parser.add_option("-D", "--debug", action = "store", dest = "debug", default = 0, type = "int", help = "Output debugging information.\t\t\t\t\tNone (default) = %default, Some = 1, More = 2") (options, args) = parser.parse_args() if not 0 <= options.debug <= 2: parser.error("debug level out of range") else: debug = options.debug # 1 = print some debug output, 2 = print a little more, 0 = none if options.longout and options.shortout: parser.error("options -l and -s are mutually exclusive") if options.date: parser.error("date option not yet implemented") if len(args) != 3: parser.error("invalid number of arguments") start = args[0] end = args[1] file = args[2] # test for times to be properly formatted, allow hh:mm or hh:mm:ss p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$') if not p.match(start) or not p.match(end): print("Invalid time specification") exit(1) # Determine Time Range yesterday = date.fromordinal(date.today().toordinal()-1).strftime("%b %d") today = datetime.now().strftime("%b %d") now = datetime.now().strftime("%R") if start > now or start > end or options.longout or options.shortout: searchstart = yesterday else: searchstart = today if (end > start > now and not options.longout) or options.shortout: searchend = yesterday else: searchend = today searchstart = searchstart + " " + start searchend = searchend + " " + end try: handle = open(file,'r') except: print("File Open Error") exit(1) # Set some initial values bufsize = 4096 # handle long lines, but put a limit them rewind = 100 # arbitrary, the optimal value is highly dependent on the structure of the file limit = 75 # arbitrary, allow for a VERY large file, but stop it if it runs away count = 0 size = os.stat(file)[ST_SIZE] beginrange = 0 midrange = size / 2 oldmidrange = midrange endrange = size linedate = '' pos1 = pos2 = 0 if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstart, searchend)) # Seek using binary search while pos1 != endrange and oldmidrange != 0 and linedate != searchstart: handle.seek(midrange) linedate, line = getdata() # sync to line ending pos1 = handle.tell() if midrange > 0: # if not BOF, discard first read if debug > 1: print("...partial: (len: {0}) '{1}'".format((len(line)), line)) linedate, line = getdata() pos2 = handle.tell() count += 1 if debug > 0: print("#{0} Beg: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'".format(count, beginrange, midrange, endrange, pos1, pos2, linedate)) if searchstart > linedate: beginrange = midrange else: endrange = midrange oldmidrange = midrange midrange = (beginrange + endrange) / 2 if count > limit: print("ERROR: ITERATION LIMIT EXCEEDED") exit(1) if debug > 0: print("...stopping: '{0}'".format(line)) # Rewind a bit to make sure we didn't miss any seek = oldmidrange while linedate >= searchstart and seek > 0: if seek < rewind: seek = 0 else: seek = seek - rewind if debug > 0: print("...rewinding") handle.seek(seek) linedate, line = getdata() # sync to line ending if debug > 1: print("...junk: '{0}'".format(line)) linedate, line = getdata() if debug > 0: print("...comparing: '{0}'".format(linedate)) # Scan forward while linedate < searchstart: if debug > 0: print("...skipping: '{0}'".format(linedate)) linedate, line = getdata() if debug > 0: print("...found: '{0}'".format(line)) if debug > 0: print("Beg: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'".format(beginrange, midrange, endrange, pos1, pos2, linedate)) # Now that the preliminaries are out of the way, we just loop, # reading lines and printing them until they are # beyond the end of the range we want while linedate <= searchend: print line linedate, line = getdata() if debug > 0: print("Start: '{0}' End: '{1}'".format(searchstart, searchend)) handle.close()

@Dennis Williamson: I see a line containing if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstar$. Is the searchstar supposed to end with a $, or is that a typo? I get a syntax error on this line (Line 159) — Stefan Lasiewski
– Stefan Lasiewski, Commented Jun 15, 2010 at 19:55
@Stefan: Thanks. It was a typo which I've fixed. For quick reference, the $ should instead be t, searchend)) so it says ... searchstart, searchend)) — Dennis Williamson
– Dennis Williamson, Commented Jun 15, 2010 at 20:48

Dennis Williamson · Accepted Answer · 2010-01-12 03:56:55Z

This will print the range of entries between a start time and an end time based on how they relate to the current time ("now").

Usage:

timegrep [-l] start end filename

Example:

$ timegrep 18:47 03:22 /some/log/file

The -l (long) option causes the longest possible output. The start time will be interpreted as yesterday if the hours and minutes value of the start time is less than both the end time and now. The end time will be interpreted as today if both the start time and end time HH:MM values are greater than "now".

Assuming that "now" is "Jan 11 19:00", this is how various example start and end times will be interpreted (without -l except as noted):

 start end range begin range end 19:01 23:59 Jan 10 Jan 10 19:01 00:00 Jan 10 Jan 11 00:00 18:59 Jan 11 Jan 11 18:59 18:58 Jan 10 Jan 10 19:01 23:59 Jan 10 Jan 11 # -l 00:00 18:59 Jan 10 Jan 11 # -l 18:59 19:01 Jan 10 Jan 11 # -l

Almost all of the script is setup. The last two lines do all the work.

Warning: no argument validation or error checking is done. Edge cases have not been thoroughly tested. This was written using gawk other versions of AWK may squawk.

#!/usr/bin/awk -f BEGIN { arg=1 if ( ARGV[arg] == "-l" ) { long = 1 ARGV[arg++] = "" } start = ARGV[arg] ARGV[arg++] = "" end = ARGV[arg] ARGV[arg++] = "" yesterday = strftime("%b %d", mktime(strftime("%Y %m %d -24 00 00"))) today = strftime("%b %d") now = strftime("%R") if ( start > now || start > end || long ) startdate = yesterday else startdate = today if ( end > now && end > start && start > now && ! long ) enddate = yesterday else enddate = today fi startdate = startdate " " start enddate = enddate " " end } $1 " " $2 " " $3 > enddate {exit} $1 " " $2 " " $3 >= startdate {print}

I think AWK is very efficient at searching through files. I don't think anything else is necessarily going to be any faster at searching an unindexed text file.

It seems you overlooked my third bullet point. The logs are on the order of 30 GB -- if the first line of the file is 7:00 and the last line is 23:00, and I want the slice between 22:00 and 22:01, I don't want the script looking at every line between 7:00 and 22:00. I want it to estimate where it would be, seek to that point, and make a new estimate until it finds it. — mike
– mike, Commented Jan 12, 2010 at 17:14
I didn't overlook it. I expressed my opinion in the last paragraph. — Dennis Williamson
– Dennis Williamson, Commented Jan 12, 2010 at 18:39

Michael Graff · Accepted Answer · 2010-01-12 01:12:06Z

0

From a quick search on the net, there are things that extract based upon keywords (like FIRE or such :) but nothing that extracts a date range from the file.

It does not seem hard to do what you propose:

Search for the starting time.
Print out that line.
If the ending time < start time, and a line's date is > end and < start, then stop.
If the ending time is > start time, and a line's date is > end, stop.

Seems straight forward, and I could write it for you if you don't mind Ruby :)

answered Jan 12, 2010 at 1:12

Michael Graff

6,6881 gold badge27 silver badges36 bronze badges

I don't mind Ruby, but #1 is not straightforward if you want to do it efficiently in a large file -- you need to seek() to the halfway point, find the nearest line, see how it begins, and repeat with a new midpoint. It's too inefficient to look at every line.

mike
– mike

2010-01-12 01:18:48 +00:00
Commented Jan 12, 2010 at 1:18
You said large, but didn't specify an actual size. Just how large is large? Worse, if there are multiple days involved, it would be pretty easy to find the wrong one only using the time. After all, if you cross a day boundary, the day the script is running will always be different than the start time. Will the files fit in memory via mmap()?

Michael Graff
– Michael Graff

2010-01-12 01:45:30 +00:00
Commented Jan 12, 2010 at 1:45
About 30 GB, on a network-mounted disk.

mike
– mike

2010-01-12 17:15:16 +00:00
Commented Jan 12, 2010 at 17:15

Add a comment |

Fred · Accepted Answer · 2010-01-15 19:51:25Z

A C++ program applying a binary search - it would need some simple modifications (i.e. calling strptime) to work with text dates.

http://gitorious.org/bs_grep/

I had a previous version with support for text dates, however it was still too slow for the scale of our log files; profiling said that over 90% of the time was spent in strptime, so, we just modified the log format to include a numeric unix timestamp as well.

Jeffrey Devloo · Accepted Answer · 2016-08-17 09:57:48Z

Even though this answer is way too late, it might be beneficial to some.

I've converted the code from @Dennis Williamson into a Python class that can be used for other python stuff.

I've added support for multiple date supports.

import os from stat import * from datetime import date, datetime import re # @TODO Support for rotated log files - currently using the current year for 'Jan 01' dates. class LogFileTimeParser(object): """ Extracts parts of a log file based on a start and enddate Uses binary search logic to speed up searching Common usage: validate log files during testing Faster than awk parsing for big log files """ version = "0.01a" # Set some initial values BUF_SIZE = 4096 # self.handle long lines, but put a limit to them REWIND = 100 # arbitrary, the optimal value is highly dependent on the structure of the file LIMIT = 75 # arbitrary, allow for a VERY large file, but stop it if it runs away line_date = '' line = None opened_file = None @staticmethod def parse_date(text, validate=True): # Supports Aug 16 14:59:01 , 2016-08-16 09:23:09 Jun 1 2005 1:33:06PM (with or without seconds, miliseconds) for fmt in ('%Y-%m-%d %H:%M:%S %f', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M', '%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S', '%b %d %Y %H:%M:%S %f', '%b %d %Y %H:%M', '%b %d %Y %H:%M:%S', '%b %d %Y %I:%M:%S%p', '%b %d %Y %I:%M%p', '%b %d %Y %I:%M:%S%p %f'): try: if fmt in ['%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S']: return datetime.strptime(text, fmt).replace(datetime.now().year) return datetime.strptime(text, fmt) except ValueError: pass if validate: raise ValueError("No valid date format found for '{0}'".format(text)) else: # Cannot use NoneType to compare datetimes. Using minimum instead return datetime.min # Function to read lines from file and extract the date and time def read_lines(self): """ Read a line from a file Return a tuple containing: the date/time in a format supported in parse_date om the line itself """ try: self.line = self.opened_file.readline(self.BUF_SIZE) except: raise IOError("File I/O Error") if self.line == '': raise EOFError("EOF reached") # Remove \n from read lines. if self.line[-1] == '\n': self.line = self.line.rstrip('\n') else: if len(self.line) >= self.BUF_SIZE: raise ValueError("Line length exceeds buffer size") else: raise ValueError("Missing newline") words = self.line.split(' ') # This results into Jan 1 01:01:01 000000 or 1970-01-01 01:01:01 000000 if len(words) >= 3: self.line_date = self.parse_date(words[0] + " " + words[1] + " " + words[2],False) else: self.line_date = self.parse_date('', False) return self.line_date, self.line def get_lines_between_timestamps(self, start, end, path_to_file, debug=False): # Set some initial values count = 0 size = os.stat(path_to_file)[ST_SIZE] begin_range = 0 mid_range = size / 2 old_mid_range = mid_range end_range = size pos1 = pos2 = 0 # If only hours are supplied # test for times to be properly formatted, allow hh:mm or hh:mm:ss p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$') if p.match(start) or p.match(end): # Determine Time Range yesterday = date.fromordinal(date.today().toordinal() - 1).strftime("%Y-%m-%d") today = datetime.now().strftime("%Y-%m-%d") now = datetime.now().strftime("%R") if start > now or start > end: search_start = yesterday else: search_start = today if end > start > now: search_end = yesterday else: search_end = today search_start = self.parse_date(search_start + " " + start) search_end = self.parse_date(search_end + " " + end) else: # Set dates search_start = self.parse_date(start) search_end = self.parse_date(end) try: self.opened_file = open(path_to_file, 'r') except: raise IOError("File Open Error") if debug: print("File: '{0}' Size: {1} Start: '{2}' End: '{3}'" .format(path_to_file, size, search_start, search_end)) # Seek using binary search -- ONLY WORKS ON FILES WHO ARE SORTED BY DATES (should be true for log files) try: while pos1 != end_range and old_mid_range != 0 and self.line_date != search_start: self.opened_file.seek(mid_range) # sync to self.line ending self.line_date, self.line = self.read_lines() pos1 = self.opened_file.tell() # if not beginning of file, discard first read if mid_range > 0: if debug: print("...partial: (len: {0}) '{1}'".format((len(self.line)), self.line)) self.line_date, self.line = self.read_lines() pos2 = self.opened_file.tell() count += 1 if debug: print("#{0} Beginning: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'". format(count, begin_range, mid_range, end_range, pos1, pos2, self.line_date)) if search_start > self.line_date: begin_range = mid_range else: end_range = mid_range old_mid_range = mid_range mid_range = (begin_range + end_range) / 2 if count > self.LIMIT: raise IndexError("ERROR: ITERATION LIMIT EXCEEDED") if debug: print("...stopping: '{0}'".format(self.line)) # Rewind a bit to make sure we didn't miss any seek = old_mid_range while self.line_date >= search_start and seek > 0: if seek < self.REWIND: seek = 0 else: seek -= self.REWIND if debug: print("...rewinding") self.opened_file.seek(seek) # sync to self.line ending self.line_date, self.line = self.read_lines() if debug: print("...junk: '{0}'".format(self.line)) self.line_date, self.line = self.read_lines() if debug: print("...comparing: '{0}'".format(self.line_date)) # Scan forward while self.line_date < search_start: if debug: print("...skipping: '{0}'".format(self.line_date)) self.line_date, self.line = self.read_lines() if debug: print("...found: '{0}'".format(self.line)) if debug: print("Beginning: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'". format(begin_range, mid_range, end_range, pos1, pos2, self.line_date)) # Now that the preliminaries are out of the way, we just loop, # reading lines and printing them until they are beyond the end of the range we want while self.line_date <= search_end: # Exclude our 'Nonetype' values if not self.line_date == datetime.min: print self.line self.line_date, self.line = self.read_lines() if debug: print("Start: '{0}' End: '{1}'".format(search_start, search_end)) self.opened_file.close() # Do not display EOFErrors: except EOFError as e: pass

Stack Exchange Network

Fast extraction of a time range from syslog logfile?

5 Answers 5

You must log in to answer this question.

Linked

Hot Network Questions

Fast extraction of a time range from syslog logfile?

5 Answers 5

You must log in to answer this question.

Linked

Related

Hot Network Questions