Large Scale Refactoring With PyBowler
I have been quite intrigued by sophisticated refactoring tools for a while, but never found an opportunity that’d warrant the time investment in learning how to use one, since the changes were always small. Lately, I had to remove some old code. While some IDEs (like PyCharm) have some basic refactoring tools, I am a creature of habit and stuck with Spacemacs. For that reason, my primary refactoring swiss-army knife has for a long time been Facebook’s codemod1 and a regular expression of some proportion, in this particular case, it seemed extremely difficult.
Cue Pybowler2 - a Python tool for safe code refactoring (announcement talk). It seriously lacks good examples, but after a few failed attempts, I found it very helpful. This post provides example code for the situations I encountered.
If you want to follow along - visit and clone the pybowler-example-repo Github repository. The example code is elementary:
# src/classes.py
class FooClass:
def __init__(self, value):
self.value = value
def run(self):
logger.info(f"FooClass.run<value={self.value}>")
return self.value + 1
# src/functions.py
def run(val1, val2):
logger.info("run")
foo = FooClass(value=val2)
return val1 + foo.run()
There is a class FooClass
with method run
, a separate function named run
,
plus some tests for both. I hope the fact there is so little of the code makes
the understanding of how things work much more accessible. If you have the
example repository set up - let’s dive straight in.
Changing Arguments
One of the simplest things that Pybowler enables out of the box is adding/removing arguments.
Every refactoring using Pybowler starts by building a Query
, followed with a
fluid interface calls to selector(s)
(e.g., select_function) and
filter(s) to find and narrow down only to parts of code that’ll be changed by
modifier(s) (e.g.,
add_argument) with the
final execute()
. There are a few more pieces to this that I am leaving out in
this post, but you can read more in the
documentation.
# examples/01-add-arguments.py
from bowler import Query
def main():
(
Query()
.select_function("run")
.add_argument(
"auto_param", '"default_value"', positional=True,
)
.execute()
)
Run bowler run examples/01-add-arguments.py src tests
to see it in action:
--- ./src/classes.py
+++ ./src/classes.py
@@ -7,6 +7,6 @@
def __init__(self, value):
self.value = value
- def run(self):
+ def run(self, auto_param):
logger.info(f"FooClass.run<value={self.value}>")
return self.value + 1
--- ./src/functions.py
+++ ./src/functions.py
@@ -5,7 +5,7 @@
logger = logging.getLogger(__name__)
-def run(val1, val2):
+def run(val1, val2, auto_param):
logger.info("run")
foo = FooClass(value=val2)
--- ./tests/test_functions.py
+++ ./tests/test_functions.py
@@ -4,7 +4,7 @@
def test_run(caplog):
- assert run(1, 1) == 3
+ assert run(1, 1, "default_value") == 3
assert caplog.record_tuples == [
("src.functions", logging.INFO, "run"),
Notice that it alters function and method named run
, even though
select_function
was used. I am not 100% certain if this discrepancy is
expected behaviour or a bug, but either way, it is something to keep an eye out.
Renaming Methods
Another common use-case for refactoring things is to rename a method. In this
case, two queries can do the job. Notice that first one looks for a function
inside a class FooClass
, not a method. It is due to the discrepancy above,
but it does the job all the same.
# examples/02-rename-method.py
from bowler import Query
def main():
(
Query()
.select_function("run")
.in_class("FooClass")
.rename("increment")
.execute()
)
(
Query()
.select_method("run")
.is_call()
.rename("increment")
.execute()
)
Execute bowler run examples/02-rename-method.py src tests
to see the results.
--- ./src/classes.py
+++ ./src/classes.py
@@ -7,6 +7,6 @@
def __init__(self, value):
self.value = value
- def run(self):
+ def increment(self):
logger.info(f"FooClass.run<value={self.value}>")
return self.value + 1
--- ./src/functions.py
+++ ./src/functions.py
@@ -10,4 +10,4 @@
foo = FooClass(value=val2)
- return val1 + foo.run()
+ return val1 + foo.increment()
--- ./tests/test_classes.py
+++ ./tests/test_classes.py
@@ -13,7 +13,7 @@
def test_run(caplog):
foo = FooClass(value=1)
- assert foo.run() == 2
+ assert foo.increment() == 2
assert caplog.record_tuples == [
("src.classes", logging.INFO, "FooClass.run<value=1>")
Notice the first Query
renames the FooClass.run
method definition into
FooClass.increment
and the second one changes the invocations on class
instances from foo.run
to foo.increment
.
A real-world use-case for this could be
DEP-201 -
a Django Enhancement Proposal that may cause a requirement to refactor a lot of
url()
calls to re_path()
s. With a little bit of creativity, it’s not hard to
come up with this:
from bowler import Query
def main():
(
Query()
.select_function("url")
.is_filename(include="urls.py")
.rename("re_path")
.execute()
)
Run bowler run examples/03-django-url.py django_example
, which contains urls.py
:
# django_example/urls.py
from django.urls import url
from foo import views
urlpatterns = [
url(r"^all/$", views.all_view, name="all"),
url(r"^update/$", views.update_view, name="update"),
url(r"^mark/$", views.mark_view, name="mark"),
url(r"^mark-all/$", views.mark_all_view, name="mark_all"),
url(r"^delete/$", views.delete_view, name="delete"),
url(r"^redirect/(?P<obj_id>[\d]+)/$", views.redirect_view, name="redirect"),
]
Observe some changes:
--- ./django_example/urls.py
+++ ./django_example/urls.py
@@ -1,12 +1,12 @@
-from django.urls import url
+from django.urls import re_path
from foo import views
urlpatterns = [
- url(r"^all/$", views.all_view, name="all"),
- url(r"^update/$", views.update_view, name="update"),
- url(r"^mark/$", views.mark_view, name="mark"),
- url(r"^mark-all/$", views.mark_all_view, name="mark_all"),
- url(r"^delete/$", views.delete_view, name="delete"),
- url(r"^redirect/(?P<obj_id>[\d]+)/$", views.redirect_view, name="redirect"),
+ re_path(r"^all/$", views.all_view, name="all"),
+ re_path(r"^update/$", views.update_view, name="update"),
+ re_path(r"^mark/$", views.mark_view, name="mark"),
+ re_path(r"^mark-all/$", views.mark_all_view, name="mark_all"),
+ re_path(r"^delete/$", views.delete_view, name="delete"),
+ re_path(r"^redirect/(?P<obj_id>[\d]+)/$", views.redirect_view, name="redirect"),
]
If you have a real Django project, you could try bowler run examples/03-django-url.py path/to/your/project
.
Using Complex Selector Patterns
Sometimes, a chunk of code has to be removed, and it varies so much that a
regex-based approach makes it difficult to do. A situation I found myself
recently was to remove a high double-digit number of asserts over pytest
caplog.record_tuples
, similar to this one:
assert caplog.record_tuples == [
("some.module", logging.INFO, "informational message"),
]
The issue here is that the expected part of the assert varies massively from file to file and from test function to test function. Module, logging level, the message itself make it challenging to write a regex to search and replace. Here is where Pybowler was helpful beyond a simple find and replace.
To be able to execute this change, I first had to write a lib2to3 pattern. The documentation3 does not explain well how to assemble it and it wasn’t obvious how to proceed past the examples. However, the announcement talk had some insight. So here’s the example pattern writing process I followed:
First, run bowler dump tests/test_classes.py
. If you are trying to do this for
your your case, change the path to the file that contains a piece of code you
want to change.
Next, scroll through the printed syntax tree and look out for the parts representing the code to be changed, in this example - the assert statement.
. . . . [assert_stmt] '\n '
. . . . . [NAME] '\n ' 'assert'
. . . . . [comparison] ' '
. . . . . . [power] ' '
. . . . . . . [NAME] ' ' 'caplog'
. . . . . . . [trailer] ''
. . . . . . . . [DOT] '' '.'
. . . . . . . . [NAME] '' 'record_tuples'
. . . . . . [EQEQUAL] ' ' '=='
. . . . . . [atom] ' '
. . . . . . . [LSQB] ' ' '['
. . . . . . . [atom] '\n '
. . . . . . . . [LPAR] '\n ' '('
. . . . . . . . [testlist_gexp] ''
. . . . . . . . . [STRING] '' '"src.classes"'
. . . . . . . . . [COMMA] '' ','
. . . . . . . . . [power] ' '
. . . . . . . . . . [NAME] ' ' 'logging'
. . . . . . . . . . [trailer] ''
. . . . . . . . . . . [DOT] '' '.'
. . . . . . . . . . . [NAME] '' 'INFO'
. . . . . . . . . [COMMA] '' ','
. . . . . . . . . [STRING] ' ' '"FooClass.run<value=1>"'
. . . . . . . . [RPAR] '' ')'
. . . . . . . [RSQB] '\n ' ']'
. . . . [NEWLINE] '' '\n'
The above syntax tree represents the following line of code:
assert caplog.record_tuples == [
("src.classes", logging.INFO, "FooClass.run<value=1>")
]
The first thing to note is the above piece of syntax tree starts with
[assert_stmt]
node and has a NAME
and comparison
children nodes. NAME
has a value assert
and comparison
- has more children. For the time being,
don’t care about the sub children and use any*
to match the subtree. It leads
to a pattern like this:
# examples/04-math-asserts.py
from bowler import Query
PATTERN = """\
assert_stmt< "assert"
comparison< any * >
>
"""
def main():
(Query().select(PATTERN).dump())
Run bowler run examples/04-match-asserts.py tests
to see this pattern in
action and notice it prints subtrees for all assert statements. Just one example:
./tests/test_classes.py
. [NAME] '' 'assert'
. [comparison] ' '
. . [power] ' '
[assert_stmt] '\n '
. . . [NAME] ' ' 'run'
. . . [trailer] ''
. [NAME] '\n ' 'assert'
. . . . [LPAR] '' '('
. [comparison] ' '
. . . . [arglist] ''
. . [power] ' '
. . . . . [NUMBER] '' '1'
. . . [NAME] ' ' 'foo'
. . . . . [COMMA] '' ','
. . . [trailer] ''
. . . . . [NUMBER] ' ' '1'
. . . . [DOT] '' '.'
. . . . [NAME] '' 'value'
. . . . [RPAR] '' ')'
. . [EQEQUAL] ' ' '=='
. . [EQEQUAL] ' ' '=='
. . [NUMBER] ' ' '3'
. . [NUMBER] ' ' '1'
To select only the assert caplog.record_tuples
, the pattern has to be narrowed
down to focus on the children of comparison
node. Looking at the tree output -
there is a power
node containing "caplog"
and a trailer
node containing "." "record_tuples"
. The finished pattern takes shape:
# examples/05-remove-code.py
from bowler import Query
PATTERN = """\
assert_stmt< "assert"
comparison<
power< "caplog"
trailer< "." "record_tuples" any* >
>
any*
>
>
"""
def remove_statement(node, capture, filename):
node.remove()
def main():
(Query().select(PATTERN).modify(remove_statement).idiff())
Using this pattern would return the asserts we’re targeting. The final part of
the puzzle is to write a
modifier4 to remove that piece of
code. Since the pattern matches the assert statement - it is the node we want to
remove in the modifier. Hence, node.remove()
in the remove_statement
above.
Running bowler run examples/05-remove-code.py tests
should show it in action.
--- ./tests/test_classes.py
+++ ./tests/test_classes.py
@@ -7,7 +7,7 @@
foo = FooClass(value=1)
assert foo.value == 1
- assert caplog.record_tuples == []
def test_run(caplog):
--- ./tests/test_classes.py
+++ ./tests/test_classes.py
@@ -15,6 +15,3 @@
assert foo.run() == 2
-
- assert caplog.record_tuples == [
- ("src.classes", logging.INFO, "FooClass.run<value=1>")
- ]
--- ./tests/test_functions.py
+++ ./tests/test_functions.py
@@ -6,7 +6,3 @@
def test_run(caplog):
assert run(1, 1) == 3
-
- assert caplog.record_tuples == [
- ("src.functions", logging.INFO, "run"),
- ("src.classes", logging.INFO, "FooClass.run<value=1>"),
- ]
The thing to note is that modifiers do not return anything and have to remove/replace/update the node(s) in place. This pattern is elementary, but looking at the existing modifier implementations (e.g. add_argument_transform) should give some hints on how to write more complex ones.
If it is not entirely clear what is happening exactly, I recommend printing or
using ipdb
to inspect the node
instance, as well as capture
inside
remove_statement
.
Final Thoughts
Pybowler is built on top of fissix, a lib2to3 backport and lib2to3 is expected to be deprecated by Python 3.12. Are there any alternatives to Pybowler? Unsurprisingly, there are. One of them is RedBaron - a community-driven open source project for writing code to modify code - its documentation may be slightly more beginner-friendly. A similar project, coming out of Instagram, is LibCST and its codemods with quite friendly documentation too. If you are working on a Django project, you may be interested in keeping an eye on django-codemod - its goal is to provide LibCST codemods to help upgrade Django.
Having used Pybowler now in a couple of instances and finding alternatives later, I want to share a few observations:
In a relatively small codebase (~55K lines) a regex-search and replace using codemod1 covered the vast majority of use-cases so far.
If another opportunity pops up to require similar code modifications, I may look at RedBaron or LibCST. These two have extended documentation, more examples, and are actively developed.
My impression where this could be very powerful is when it comes to transforming large blocks of code. An example I can come up would be to switch a framework or transform some duplicated code to use abstraction. Still, I imagine these have prerequisites to be effective, e.g. a style-guide that supports such changes and experience in writing them, to name a few.
Hope this article and the accompanying pybowler-example-repo repository with a simple hands-on introduction to automated refactoring inspires you to look at automated refactoring as an option when search and replace does not cut it.
[Footnotes]
Facebook’s codemod: a tool/library to assist you with large-scale codebase refactors that can be partially automated but still require human oversight and occasional intervention. ↩︎ ↩︎
↩︎Selector patterns follow a very simple syntax, as defined in the lib2to3 pattern grammar. Matching elements of the Python grammar is done by listing the grammar element, optionally followed by angle brackets containing nested match expressions. The
any
keyword can be used to match grammar elements, regardless of their type, while*
denotes elements that repeat zero or more times. Make sure to include necessary string literal tokens when using nested expressions, andany*
to match remaining grammar elements.
↩︎Modifiers in Bowler are functions that modify, add, remove, or replace syntax tree elements originally matched by selectors after elements have passed all filters. Modifications may occur anywhere in the syntax tree, either above or below the matched element, and may include multiple modifications.)