Julius Seporaitis
on hobbies and work

Large Scale Refactoring With PyBowler

I have been quite intrigued by sophisticated refactoring tools for a while, but never found an opportunity that’d warrant the time investment in learning how to use one, since the changes were always small. Lately, I had to remove some old code. While some IDEs (like PyCharm) have some basic refactoring tools, I am a creature of habit and stuck with Spacemacs. For that reason, my primary refactoring swiss-army knife has for a long time been Facebook’s codemod1 and a regular expression of some proportion, in this particular case, it seemed extremely difficult.

Cue Pybowler2 - a Python tool for safe code refactoring (announcement talk). It seriously lacks good examples, but after a few failed attempts, I found it very helpful. This post provides example code for the situations I encountered.

If you want to follow along - visit and clone the pybowler-example-repo Github repository. The example code is elementary:

# src/classes.py

class FooClass:
    def __init__(self, value):
        self.value = value

    def run(self):
        logger.info(f"FooClass.run<value={self.value}>")
        return self.value + 1

# src/functions.py

def run(val1, val2):
    logger.info("run")

    foo = FooClass(value=val2)

    return val1 + foo.run()

There is a class FooClass with method run, a separate function named run, plus some tests for both. I hope the fact there is so little of the code makes the understanding of how things work much more accessible. If you have the example repository set up - let’s dive straight in.

Changing Arguments

One of the simplest things that Pybowler enables out of the box is adding/removing arguments.

Every refactoring using Pybowler starts by building a Query, followed with a fluid interface calls to selector(s) (e.g., select_function) and filter(s) to find and narrow down only to parts of code that’ll be changed by modifier(s) (e.g., add_argument) with the final execute(). There are a few more pieces to this that I am leaving out in this post, but you can read more in the documentation.

# examples/01-add-arguments.py
from bowler import Query


def main():
    (
        Query()
        .select_function("run")
        .add_argument(
            "auto_param", '"default_value"', positional=True,
        )
        .execute()
    )

Run bowler run examples/01-add-arguments.py src tests to see it in action:

--- ./src/classes.py
+++ ./src/classes.py
@@ -7,6 +7,6 @@
     def __init__(self, value):
         self.value = value

-    def run(self):
+    def run(self, auto_param):
         logger.info(f"FooClass.run<value={self.value}>")
         return self.value + 1
--- ./src/functions.py
+++ ./src/functions.py
@@ -5,7 +5,7 @@
 logger = logging.getLogger(__name__)


-def run(val1, val2):
+def run(val1, val2, auto_param):
     logger.info("run")

     foo = FooClass(value=val2)
--- ./tests/test_functions.py
+++ ./tests/test_functions.py
@@ -4,7 +4,7 @@


 def test_run(caplog):
-    assert run(1, 1) == 3
+    assert run(1, 1, "default_value") == 3

     assert caplog.record_tuples == [
         ("src.functions", logging.INFO, "run"),

Notice that it alters function and method named run, even though select_function was used. I am not 100% certain if this discrepancy is expected behaviour or a bug, but either way, it is something to keep an eye out.

Renaming Methods

Another common use-case for refactoring things is to rename a method. In this case, two queries can do the job. Notice that first one looks for a function inside a class FooClass, not a method. It is due to the discrepancy above, but it does the job all the same.

# examples/02-rename-method.py
from bowler import Query


def main():
    (
        Query()
        .select_function("run")
        .in_class("FooClass")
        .rename("increment")
        .execute()
    )
    (
        Query()
        .select_method("run")
        .is_call()
        .rename("increment")
        .execute()
    )

Execute bowler run examples/02-rename-method.py src tests to see the results.

--- ./src/classes.py
+++ ./src/classes.py
@@ -7,6 +7,6 @@
     def __init__(self, value):
         self.value = value

-    def run(self):
+    def increment(self):
         logger.info(f"FooClass.run<value={self.value}>")
         return self.value + 1
--- ./src/functions.py
+++ ./src/functions.py
@@ -10,4 +10,4 @@

     foo = FooClass(value=val2)

-    return val1 + foo.run()
+    return val1 + foo.increment()
--- ./tests/test_classes.py
+++ ./tests/test_classes.py
@@ -13,7 +13,7 @@
 def test_run(caplog):
     foo = FooClass(value=1)

-    assert foo.run() == 2
+    assert foo.increment() == 2

     assert caplog.record_tuples == [
         ("src.classes", logging.INFO, "FooClass.run<value=1>")

Notice the first Query renames the FooClass.run method definition into FooClass.increment and the second one changes the invocations on class instances from foo.run to foo.increment.

A real-world use-case for this could be DEP-201 - a Django Enhancement Proposal that may cause a requirement to refactor a lot of url() calls to re_path()s. With a little bit of creativity, it’s not hard to come up with this:

from bowler import Query


def main():
    (
        Query()
        .select_function("url")
        .is_filename(include="urls.py")
        .rename("re_path")
        .execute()
    )

Run bowler run examples/03-django-url.py django_example, which contains urls.py:

# django_example/urls.py
from django.urls import url

from foo import views

urlpatterns = [
    url(r"^all/$", views.all_view, name="all"),
    url(r"^update/$", views.update_view, name="update"),
    url(r"^mark/$", views.mark_view, name="mark"),
    url(r"^mark-all/$", views.mark_all_view, name="mark_all"),
    url(r"^delete/$", views.delete_view, name="delete"),
    url(r"^redirect/(?P<obj_id>[\d]+)/$", views.redirect_view, name="redirect"),
]

Observe some changes:

--- ./django_example/urls.py
+++ ./django_example/urls.py
@@ -1,12 +1,12 @@
-from django.urls import url
+from django.urls import re_path

 from foo import views

 urlpatterns = [
-    url(r"^all/$", views.all_view, name="all"),
-    url(r"^update/$", views.update_view, name="update"),
-    url(r"^mark/$", views.mark_view, name="mark"),
-    url(r"^mark-all/$", views.mark_all_view, name="mark_all"),
-    url(r"^delete/$", views.delete_view, name="delete"),
-    url(r"^redirect/(?P<obj_id>[\d]+)/$", views.redirect_view, name="redirect"),
+    re_path(r"^all/$", views.all_view, name="all"),
+    re_path(r"^update/$", views.update_view, name="update"),
+    re_path(r"^mark/$", views.mark_view, name="mark"),
+    re_path(r"^mark-all/$", views.mark_all_view, name="mark_all"),
+    re_path(r"^delete/$", views.delete_view, name="delete"),
+    re_path(r"^redirect/(?P<obj_id>[\d]+)/$", views.redirect_view, name="redirect"),
 ]

If you have a real Django project, you could try bowler run examples/03-django-url.py path/to/your/project.

Using Complex Selector Patterns

Sometimes, a chunk of code has to be removed, and it varies so much that a regex-based approach makes it difficult to do. A situation I found myself recently was to remove a high double-digit number of asserts over pytest caplog.record_tuples, similar to this one:

assert caplog.record_tuples == [
    ("some.module", logging.INFO, "informational message"),
]

The issue here is that the expected part of the assert varies massively from file to file and from test function to test function. Module, logging level, the message itself make it challenging to write a regex to search and replace. Here is where Pybowler was helpful beyond a simple find and replace.

To be able to execute this change, I first had to write a lib2to3 pattern. The documentation3 does not explain well how to assemble it and it wasn’t obvious how to proceed past the examples. However, the announcement talk had some insight. So here’s the example pattern writing process I followed:

First, run bowler dump tests/test_classes.py. If you are trying to do this for your your case, change the path to the file that contains a piece of code you want to change.

Next, scroll through the printed syntax tree and look out for the parts representing the code to be changed, in this example - the assert statement.

.  .  .  .  [assert_stmt] '\n    '
.  .  .  .  .  [NAME] '\n    ' 'assert'
.  .  .  .  .  [comparison] ' '
.  .  .  .  .  .  [power] ' '
.  .  .  .  .  .  .  [NAME] ' ' 'caplog'
.  .  .  .  .  .  .  [trailer] ''
.  .  .  .  .  .  .  .  [DOT] '' '.'
.  .  .  .  .  .  .  .  [NAME] '' 'record_tuples'
.  .  .  .  .  .  [EQEQUAL] ' ' '=='
.  .  .  .  .  .  [atom] ' '
.  .  .  .  .  .  .  [LSQB] ' ' '['
.  .  .  .  .  .  .  [atom] '\n        '
.  .  .  .  .  .  .  .  [LPAR] '\n        ' '('
.  .  .  .  .  .  .  .  [testlist_gexp] ''
.  .  .  .  .  .  .  .  .  [STRING] '' '"src.classes"'
.  .  .  .  .  .  .  .  .  [COMMA] '' ','
.  .  .  .  .  .  .  .  .  [power] ' '
.  .  .  .  .  .  .  .  .  .  [NAME] ' ' 'logging'
.  .  .  .  .  .  .  .  .  .  [trailer] ''
.  .  .  .  .  .  .  .  .  .  .  [DOT] '' '.'
.  .  .  .  .  .  .  .  .  .  .  [NAME] '' 'INFO'
.  .  .  .  .  .  .  .  .  [COMMA] '' ','
.  .  .  .  .  .  .  .  .  [STRING] ' ' '"FooClass.run<value=1>"'
.  .  .  .  .  .  .  .  [RPAR] '' ')'
.  .  .  .  .  .  .  [RSQB] '\n    ' ']'
.  .  .  .  [NEWLINE] '' '\n'

The above syntax tree represents the following line of code:

    assert caplog.record_tuples == [
        ("src.classes", logging.INFO, "FooClass.run<value=1>")
    ]

The first thing to note is the above piece of syntax tree starts with [assert_stmt] node and has a NAME and comparison children nodes. NAME has a value assert and comparison - has more children. For the time being, don’t care about the sub children and use any* to match the subtree. It leads to a pattern like this:

# examples/04-math-asserts.py
from bowler import Query

PATTERN = """\
assert_stmt< "assert"
  comparison< any * >
>
"""


def main():
    (Query().select(PATTERN).dump())

Run bowler run examples/04-match-asserts.py tests to see this pattern in action and notice it prints subtrees for all assert statements. Just one example:

./tests/test_classes.py
.  [NAME] '' 'assert'
.  [comparison] ' '
.  .  [power] ' '
[assert_stmt] '\n    '
.  .  .  [NAME] ' ' 'run'
.  .  .  [trailer] ''
.  [NAME] '\n    ' 'assert'
.  .  .  .  [LPAR] '' '('
.  [comparison] ' '
.  .  .  .  [arglist] ''
.  .  [power] ' '
.  .  .  .  .  [NUMBER] '' '1'
.  .  .  [NAME] ' ' 'foo'
.  .  .  .  .  [COMMA] '' ','
.  .  .  [trailer] ''
.  .  .  .  .  [NUMBER] ' ' '1'
.  .  .  .  [DOT] '' '.'
.  .  .  .  [NAME] '' 'value'
.  .  .  .  [RPAR] '' ')'
.  .  [EQEQUAL] ' ' '=='
.  .  [EQEQUAL] ' ' '=='
.  .  [NUMBER] ' ' '3'
.  .  [NUMBER] ' ' '1'

To select only the assert caplog.record_tuples, the pattern has to be narrowed down to focus on the children of comparison node. Looking at the tree output - there is a power node containing "caplog" and a trailer node containing "." "record_tuples". The finished pattern takes shape:

# examples/05-remove-code.py
from bowler import Query

PATTERN = """\
assert_stmt< "assert"
  comparison<
    power< "caplog"
      trailer< "." "record_tuples" any* >
    >
    any*
  >
>
"""


def remove_statement(node, capture, filename):
    node.remove()


def main():
    (Query().select(PATTERN).modify(remove_statement).idiff())

Using this pattern would return the asserts we’re targeting. The final part of the puzzle is to write a modifier4 to remove that piece of code. Since the pattern matches the assert statement - it is the node we want to remove in the modifier. Hence, node.remove() in the remove_statement above.

Running bowler run examples/05-remove-code.py tests should show it in action.

--- ./tests/test_classes.py
+++ ./tests/test_classes.py
@@ -7,7 +7,7 @@
     foo = FooClass(value=1)

     assert foo.value == 1
-    assert caplog.record_tuples == []

 def test_run(caplog):
--- ./tests/test_classes.py
+++ ./tests/test_classes.py
@@ -15,6 +15,3 @@

     assert foo.run() == 2
-
-    assert caplog.record_tuples == [
-        ("src.classes", logging.INFO, "FooClass.run<value=1>")
-    ]

--- ./tests/test_functions.py
+++ ./tests/test_functions.py
@@ -6,7 +6,3 @@
 def test_run(caplog):
     assert run(1, 1) == 3
-
-    assert caplog.record_tuples == [
-        ("src.functions", logging.INFO, "run"),
-        ("src.classes", logging.INFO, "FooClass.run<value=1>"),
-    ]

The thing to note is that modifiers do not return anything and have to remove/replace/update the node(s) in place. This pattern is elementary, but looking at the existing modifier implementations (e.g. add_argument_transform) should give some hints on how to write more complex ones.

If it is not entirely clear what is happening exactly, I recommend printing or using ipdb to inspect the node instance, as well as capture inside remove_statement.

Final Thoughts

Pybowler is built on top of fissix, a lib2to3 backport and lib2to3 is expected to be deprecated by Python 3.12. Are there any alternatives to Pybowler? Unsurprisingly, there are. One of them is RedBaron - a community-driven open source project for writing code to modify code - its documentation may be slightly more beginner-friendly. A similar project, coming out of Instagram, is LibCST and its codemods with quite friendly documentation too. If you are working on a Django project, you may be interested in keeping an eye on django-codemod - its goal is to provide LibCST codemods to help upgrade Django.

Having used Pybowler now in a couple of instances and finding alternatives later, I want to share a few observations:

  1. In a relatively small codebase (~55K lines) a regex-search and replace using codemod1 covered the vast majority of use-cases so far.

  2. If another opportunity pops up to require similar code modifications, I may look at RedBaron or LibCST. These two have extended documentation, more examples, and are actively developed.

  3. My impression where this could be very powerful is when it comes to transforming large blocks of code. An example I can come up would be to switch a framework or transform some duplicated code to use abstraction. Still, I imagine these have prerequisites to be effective, e.g. a style-guide that supports such changes and experience in writing them, to name a few.

Hope this article and the accompanying pybowler-example-repo repository with a simple hands-on introduction to automated refactoring inspires you to look at automated refactoring as an option when search and replace does not cut it.


[Footnotes]

  1. Facebook’s codemod: a tool/library to assist you with large-scale codebase refactors that can be partially automated but still require human oversight and occasional intervention. ↩︎ ↩︎2

  2. Bowler: Safe code refactoring for modern Python ↩︎

  3. Pattern Syntax:

    Selector patterns follow a very simple syntax, as defined in the lib2to3 pattern grammar. Matching elements of the Python grammar is done by listing the grammar element, optionally followed by angle brackets containing nested match expressions. The any keyword can be used to match grammar elements, regardless of their type, while * denotes elements that repeat zero or more times. Make sure to include necessary string literal tokens when using nested expressions, and any* to match remaining grammar elements.

    ↩︎

  4. Modifiers:

    Modifiers in Bowler are functions that modify, add, remove, or replace syntax tree elements originally matched by selectors after elements have passed all filters. Modifications may occur anywhere in the syntax tree, either above or below the matched element, and may include multiple modifications.)

    ↩︎