How to List Files in a Directory using Python

How to List Files in a Directory using Python

Using os.walk()

The os module contains a long list of methods that deal with the filesystem, and the operating system. One of them is walk(), which generates the filenames in a directory tree by walking the tree either top-down or bottom-up (with top-down being the default setting).

os.walk() returns a list of three items. It contains the name of the root directory, a list of the names of the subdirectories, and a list of the filenames in the current directory. Listing 1 shows how to write this with only three lines of code. This works with both Python 2 and 3 interpreters.

Listing 1: Traversing the current directory using os.walk()

import os

for root, dirs, files in os.walk("."):
    for filename in files:
        print(filename)

Using the Command Line via Subprocess

Note: While this is a valid way to list files in a directory, it is not recommended as it introduces the opportunity for command injection attacks.

As already described in the article Parallel Processing in Python, the subprocess module allows you to execute a system command, and collect its result. The system command we call in this case is the following one:

Example 1: Listing the files in the current directory

$ ls -p . | grep -v /$

The command ls -p . lists directory files for the current directory, and adds the delimiter / at the end of the name of each subdirectory, which we’ll need in the next step. The output of this call is piped to the grep command that filters the data as we need it.

The parameters -v /$ exclude all the names of entries that end with the delimiter /. Actually, /$ is a Regular Expression that matches all the strings that contain the character / as the very last character before the end of the string, which is represented by $.

The subprocess module allows to build real pipes, and to connect the input and output streams as you do on a command line. Calling the method subprocess.Popen() opens a corresponding process, and defines the two parameters named stdin and stdout.

Listing 2 shows how to program that. The first variable ls is defined as a process executing ls -p . that outputs to a pipe. That’s why the stdout channel is defined as subprocess.PIPE. The second variable grep is defined as a process, too, but executes the command grep -v /$, instead.

To read the output of the ls command from the pipe, the stdin channel of grep is defined as ls.stdout. Finally, the variable endOfPipe reads the output of grep from grep.stdout that is printed to stdout element-wise in the for-loop below. The output is seen in Example 2.

Listing 2: Defining two processes connected with a pipe

import subprocess

/* define the ls command */
ls = subprocess.Popen(["ls", "-p", "."],
                      stdout=subprocess.PIPE,
                     )

/* define the grep command */
grep = subprocess.Popen(["grep", "-v", "/$"],
                        stdin=ls.stdout,
                        stdout=subprocess.PIPE,
                        )

/* read from the end of the pipe (stdout) */
endOfPipe = grep.stdout

/* output the files line by line */
for line in endOfPipe:
    print (line)

Example 2: Running the program

$ python find-files3.py
find-files2.py
find-files3.py
find-files4.py
...

This solution works quite well with both Python 2 and 3, but can we improve it somehow? Let us have a look at the other variants, then.

Combining os and fnmatch

As you have seen before the solution using subprocesses is elegant but requires lots of code. Instead, let us combine the methods from the two modules os, and fnmatch. This variant works with Python 2 and 3, too.

As the first step, we import the two modules os, and fnmatch. Next, we define the directory we would like to list the files using os.listdir(), as well as the pattern for which files to filter. In a for loop we iterate over the list of entries stored in the variable listOfFiles.

Finally, with the help of fnmatch we filter for the entries we are looking for, and print the matching entries to stdout. Listing 3 contains the Python script, and Example 3 the corresponding output.

Listing 3: Listing files using os and fnmatch module

import os, fnmatch

listOfFiles = os.listdir('.')
pattern = "*.py"
for entry in listOfFiles:
    if fnmatch.fnmatch(entry, pattern):
            print (entry)

Example 3: The output of Listing 3

$ python2 find-files.py
find-files.py
find-files2.py
find-files3.py
...

Using os.listdir() and Generators

In simple terms, a generator is a powerful iterator that keeps its state. To learn more about generators, check out one of our previous articles, Python Generators.

The following variant combines the listdir() method of the os module with a generator function. The code works with both versions 2 and 3 of Python.

As you may have noted before, the listdir() method returns the list of entries for the given directory. The method os.path.isfile() returns True if the given entry is a file. The yield operator quits the function but keeps the current state, and returns only the name of the entry detected as a file. This allows us to loop over the generator function (see Listing 4). The output is identical to the one from Example 3.

Listing 4: Combining os.listdir() and a generator function

import os

def files(path):
    for file in os.listdir(path):
        if os.path.isfile(os.path.join(path, file)):
            yield file

for file in files("."):
    print (file)

Use pathlib

The pathlib module describes itself as a way to “Parse, build, test, and otherwise work on filenames and paths using an object-oriented API instead of low-level string operations”. This sounds cool – let’s do it. Starting with Python 3, the module belongs to the standard distribution.

In Listing 5, we first define the directory. The dot (“.”) defines the current directory. Next, the iterdir() method returns an iterator that yields the names of all the files. In a for loop we print the name of the files one after the other.

Listing 5: Reading directory contents with pathlib

import pathlib

/* define the path */
currentDirectory = pathlib.Path('.')

for currentFile in currentDirectory.iterdir():
    print(currentFile)

Again, the output is identical to the one from Example 3.

As an alternative, we can retrieve files by matching their filenames by using something called a glob. This way we can only retrieve the files we want. For example, in the code below we only want to list the Python files in our directory, which we do by specifying “*.py” in the glob.

Listing 6: Using pathlib with the glob method

import pathlib

/* define the path */
currentDirectory = pathlib.Path('.')

/* define the pattern */
currentPattern = "*.py"

for currentFile in currentDirectory.glob(currentPattern):
    print(currentFile)

Using os.scandir()

In Python 3.6, a new method becomes available in the os module. It is named scandir(), and significantly simplifies the call to list files in a directory.

Having imported the os module first, use the getcwd() method to detect the current working directory, and save this value in the path variable. Next, scandir() returns a list of entries for this path, which we test for being a file using the is_file() method.

Listing 7: Reading directory contents with scandir()

import os

/* detect the current working directory */
path = os.getcwd()

/* read the entries */
with os.scandir(path) as listOfEntries:
    for entry in listOfEntries:
        /* print all entries that are files */
        if entry.is_file():
            print(entry.name)

Again, the output of Listing 7 is identical to the one from Example 3.

Conclusion

There is disagreement which version is the best, which is the most elegant, and which is the most “pythonic” one. I like the simplicity of the os.walk() method as well as the usage of both the fnmatch and pathlib modules.

The two versions with the processes/piping and the iterator require a deeper understanding of UNIX processes and Python knowledge, so they may not be best for all programmers due to their added (and unnecessary) complexity.

To find an answer to which version is the quickest one, the timeit module is quite handy. This module counts the time that has elapsed between two events.

To compare all of our solutions without modifying them, we use a Python functionality: call the Python interpreter with the name of the module, and the appropriate Python code to be executed. To do that for all the Python scripts at once a shell script helps (Listing 8).

Listing 8: Evaluating the execution time using the timeit module

#! /bin/bash

for filename in *.py; do
    echo "$filename:"
    cat $filename | python3 -m timeit
    echo " "
done

The tests were taken using Python 3.5.3. The result is as follows whereas os.walk() gives the best result. Running the tests with Python 2 returns different values but does not change the order – os.walk() is still on top of the list.

 

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars