How to Read a File Line-by-Line in Python

Google –> SETScholars.

How to Read a File Line-by-Line in Python

Introduction

Over the course of my working life I have had the opportunity to use many programming concepts and technologies to do countless things. Some of these things involve relatively low-value fruits of my labor, such as automating the error prone or mundane like report generation, task automation, and general data reformatting. Others have been much more valuable, such as developing data products, web applications, and data analysis and processing pipelines. One thing that is notable about nearly all of these projects is the need to simply open a file, parse its contents, and do something with it.

However, what do you do when the file you are trying to consume is quite large? What if the file is several GB of data or larger? Again, this has been another frequent aspect of my programming career, which has primarily been spent in the BioTech sector, where it’s common to encounter files of 1 TB+ in size.

The answer to this problem is to read in chunks of a file at a time, process it, then free it from memory so you can pull in and process another chunk until the whole massive file has been processed. While it is up to the programmer to determine a suitable chunk size, for many applications it is suitable to process a file one line at a time.

Basic File IO in Python

Being a great general purpose programming language, Python has a number of very useful file IO functionality in its standard library of built-in functions and modules. The built-in open() function is what you use to open a file object for either reading or writing purposes.

fp = open('path/to/file.txt', 'r')

The open() function takes in multiple arguments. We will be focusing on the first two, with the first being a positional string parameter representing the path to the file that should be opened. The second optional parameter is also a string, which specifies the mode of interaction you intend for the file object being returned by the function call. The most common modes are listed in the table below, with the default being ‘r’ for reading.

Mode Description
r Open for reading plain text
w Open for writing plain text
a Open an existing file for appending plain text
rb Open for reading binary data
wb Open for writing binary data

Once you have written or read all of the desired data for a file object you need to close the file so that resources can be reallocated on the operating system that the code is running on.

fp.close()

You will often see many code snippets on the internet or in programs in the wild that do not explicitly close file objects that have been generated in accord with the example above. It is always good practice to close a file object resource, but many of us either are too lazy or forgetful to do so or think we are smart because documentation suggests that an open file object will self close once a process terminates. This is not always the case.

Instead of harping on how important it is to always call close() on a file object, I would like to provide an alternate and more elegant way to open a file object and ensure that the Python interpreter cleans up after us 🙂

with open('path/to/file.txt') as fp:
    /* do stuff with fp */

By simply using the with keyword (introduced in Python 2.5) to wrap our code for opening a file object, the internals of Python will do something similar to the following code to ensure that no matter what the file object is closed after use.

try:
    fp = open('path/to/file.txt')

    /* do stuff with fp */
finally:
    fp.close()

Either of these two methods are suitable, with the first example being the more “Pythonic” way.

Reading Line by Line

Now, lets get to actually reading in a file. The file object returned from open() has three common explicit methods (readreadline, and readlines) to read in data and one more implicit way.

The read method will read in all the data into one text string. This is useful for smaller files where you would like to do text manipulation on the entire file, or whatever else suits you. Then there is readline which is one useful way to only read in individual line incremental amounts at a time and return them as strings. The last explicit method, readlines, will read all the lines of a file and return them as a list of strings.

As mentioned earlier, you can use these methods to only load small chunks of the file at a time. To do this with these methods, you can pass a parameter to them telling how many bytes to load at a time. This is the only argument these methods accept.

One implementation for reading a text file one line at a time is shown below, which is done via the readline() method.

In readline.py you will find the following code. In the terminal if you run $ python readline.py you can see the output of reading all the lines of the Iliad, as well as their line numbers.

filepath = 'Iliad.txt'
with open(filepath) as fp:
   line = fp.readline()
   cnt = 1
   while line:
       print("Line {}: {}".format(cnt, line.strip()))
       line = fp.readline()
       cnt += 1

The above code snippet opens a file object stored as a variable called fp, then reads in a line at a time by calling readline on that file object iteratively in a while loop and prints it to the console.

Running this code you should see something like the following:

$ python forlinein.py 
Line 0: BOOK I
Line 1: 
Line 2: The quarrel between Agamemnon and Achilles--Achilles withdraws
Line 3: from the war, and sends his mother Thetis to ask Jove to help
Line 4: the Trojans--Scene between Jove and Juno on Olympus.
Line 5: 
Line 6: Sing, O goddess, the anger of Achilles son of Peleus, that brought
Line 7: countless ills upon the Achaeans. Many a brave soul did it send
Line 8: hurrying down to Hades, and many a hero did it yield a prey to dogs and
Line 9: vultures, for so were the counsels of Jove fulfilled from the day on
...

While this is perfectly fine, there is one final way that I mentioned fleetingly earlier, which is less explicit but a bit more elegant, which I greatly prefer. This final way of reading a file line-by-line includes iterating over a file object in a for loop, assigning each line to a special variable called line. The above code snippet can be replicated in the following code, which can be found in the Python script forlinein.py:

filepath = 'Iliad.txt'
with open(filepath) as fp:
   for cnt, line in enumerate(fp):
       print("Line {}: {}".format(cnt, line))

In this implementation we are taking advantage of a built-in Python functionality that allows us to iterate over the file object implicitly using a for loop in combination of using the iterable object fp. Not only is this simpler to read but it also takes fewer lines of code to write, which is always a best practice worthy of following.

An Example Application

I would be remiss to write an application on how to consume information in a text file without demonstrating at least a trivial usage of how to use such a worthy skill. That being said, I will be demonstrating a small application that can be found in wordcount.py, which calculates the frequency of each word present in “The Iliad of Homer” used in previous examples. This creates a simple bag of words, which is commonly used in NLP applications.


import sys
import os

def main():
   filepath = sys.argv[1]

   if not os.path.isfile(filepath):
       print("File path {} does not exist. Exiting...".format(filepath))
       sys.exit()
  
   bag_of_words = {}
   with open(filepath) as fp:
       cnt = 0
       for line in fp:
           print("line {} contents {}".format(cnt, line))
           record_word_cnt(line.strip().split(' '), bag_of_words)
           cnt += 1
   sorted_words = order_bag_of_words(bag_of_words, desc=True)
   print("Most frequent 10 words {}".format(sorted_words[:10]))
  
def order_bag_of_words(bag_of_words, desc=False):
   words = [(word, cnt) for word, cnt in bag_of_words.items()]
   return sorted(words, key=lambda x: x[1], reverse=desc)

def record_word_cnt(words, bag_of_words):
    for word in words:
        if word != '':
            if word.lower() in bag_of_words:
                bag_of_words[word.lower()] += 1
            else:
                bag_of_words[word.lower()] = 1

if __name__ == '__main__':
    main()

The above code represents a command line python script that expects a file path passed in as an argument. The script uses the os module to make sure that the passed in file path is a file that exists on the disk. If the path exists then each line of the file is read and passed to a function called record_word_cnt as a list of strings, delimited the spaces between words as well as a dictionary called bag_of_words. The record_word_cnt function counts each instance of every word and records it in the bag_of_words dictionary.

Once all the lines of the file are read and recorded in the bag_of_words dictionary, then a final function call to order_bag_of_words is called, which returns a list of tuples in (word, word count) format, sorted by word count. The returned list of tuples is used to print the most frequently occurring 10 words.

Conclusion

So, in this article we have explored ways to read a text file line-by-line in two ways, including a way that I feel is a bit more Pythonic (this being the second way demonstrated in forlinein.py). To wrap things up I presented a trivial application that is potentially useful for reading in and preprocessing data that could be used for text analytics or sentiment analysis.

As always I look forward to your comments and I hope you can use what has been discussed to develop exciting and useful applications.

 

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars