Python Generators and Comprehension#

Digging into generators and comprehension - from basics to to implementation in a comprehensive tutorial. This is a walkthrough for beginners that will build up to real world examples.

Note

This is an in-progress draft.

import numpy as np
import string

Here we will build a dictionary of items that we can use for examples. Let’s make it keyed on alphabetical characters and random integers.

The algorithm for this will be like:

Do something 20 times so we can have 20 items as (key, value). For each of the 20 iterations, choose a random lowercase alphabetical character as string and a positive integer up to 100.

alpha = list(string.ascii_lowercase)
collection = {
    np.random.choice(alpha): np.random.randint(100)
    for _ in range(20)
}
collection

{'i': 15,
 'u': 14,
 'h': 74,
 'x': 93,
 'r': 0,
 'd': 64,
 'v': 31,
 'k': 17,
 'm': 93,
 'p': 18,
 'l': 80,
 'o': 31,
 'c': 48,
 'q': 45,
 'b': 55,
 's': 40,
 't': 53}

We did this through a method called dictionary comprehension where we build a dictionary object on the fly. You can spot these comprehension methods (dictionaries or lists) by iteration code within {} or [] for dictionary or list comprehension, respectively.

Building our dictionary collection, we iterate over range(20) so we will have 20 key:value pairs. Since we are not using the number yielded from the range function, we use the python internal reference variable name _ to indicate to the user that we are not utilizing this variable; our range is simply letting us do something 20 times.

For each iteration generated by iterating over range, we randomly sample alpha using numpy.random.choice. alpha contains the english language alphabet in lower case by way of string.ascii_lowercase and we call list on this because string.ascii_lowercase is a generator. We must have the complete list to sample with numpy otherwise we would not know our choices if only observing one object yielded to us.

Next, our value is assigned by randomly selecting an integer up to 100 with numpy.random.randint.

In order to tell our comprehension these are key:value pairs, our key (numpy.random.choice(alpha)) is assigned first, followed by :, then our value (numpy.random.randint(100)). This gives us our key: value. These will all be collected within {} and assigned to the variable collection, which we can call and the __repr__ function of our collection object (a dict class) will return a string representation.

Now let’s demonstrate a few things we can do with this collection dictionary.

from collections import Counter

value_counts = Counter(collection.values())
value_counts.most_common()

[(93, 2),
 (31, 2),
 (15, 1),
 (14, 1),
 (74, 1),
 (0, 1),
 (64, 1),
 (17, 1),
 (18, 1),
 (80, 1),
 (48, 1),
 (45, 1),
 (55, 1),
 (40, 1),
 (53, 1)]

collections.Counter allows us to feed it an array of data and have it tabulate occurrences. We call the most_common function to sort the counts descending by occurrence. We could have also called collections.Counter.most_common(5) to get the top 5, for example.

This is really doing something like the following.

value_counts = {}
for val in collection.values():
    value_counts[val] = value_counts.get(val, 0) + 1

sorted(value_counts.items(), key=lambda count: count[1], reverse=True)

[(93, 2),
 (31, 2),
 (15, 1),
 (14, 1),
 (74, 1),
 (0, 1),
 (64, 1),
 (17, 1),
 (18, 1),
 (80, 1),
 (48, 1),
 (45, 1),
 (55, 1),
 (40, 1),
 (53, 1)]

First we set an empty dictionary object to which we will tabulate our value occurrences. We then iterate our collection values through the collection.values() generator. For each value, we will assign value_counts a key of that value and increment it’s value by 1 for each observation. To do this, we get the current value, val, by calling value_counts.get(val). But if this does not yet exist, we get an error. So we use a default value of zero by calling this function like value_counts.get(val, 0). Then we can take the actual value, or default starting point of 0 and add 1 for this observation.

Next, to get things sorted like collections.Counter.most_common we will sort our list using sorted. We iterate over key:value pairs, and we tell it that the sorting key is value where count represents the tuple(key, value) and we use the value by setting the key as count[1] which is the value position of our iteration tuple. lambda is letting is call in inline function and we could do any sort of operation. Maybe this value is an error and we need to square it: lambda count: count[1] ** 2. However, we would be better doing that in a seperate operation since it obfuscates from the user that we are sorting on error and not the original value. Finally, reverse=True tells sorted we want max -> min.

So we have taken our random collection of alphabetical keys and generated the most common integer value occurrences. We could have easily done the same for the alphabetical keys by calling collections.Counter(collection.keys()).most_common().

Now let’s do some conditional selection. We will first find all keys whose value is gt 40.

gt40keys = [
    k for (k, v) in collection.items()
    if v > 40
]
gt40keys

['h', 'x', 'd', 'm', 'l', 'c', 'q', 'b', 't']

That used list comprehension (the list method of how we generated our collection by way of dictionary comprehension) to iterate through key: value pairs of collection and collect the key if the value is gt 40. The result is a list for keys.

We could have returned the full key: value pair with (k, v) for (k, v) in collection.items() ... or we could have made a new dictionary of only those key: value pairs matching that condition. Let’s do that because it is a common routine.

gt40collection = {
    k: v for (k, v) in collection.items()
    if v > 40
}
gt40collection

{'h': 74,
 'x': 93,
 'd': 64,
 'm': 93,
 'l': 80,
 'c': 48,
 'q': 45,
 'b': 55,
 't': 53}

So this looks just like our previous list comprehension, but to generate a dictionary we use {} to make it a dictionary comprehension, we assign key k the value v by k: v as we normally do but only if the value is gt 40. The result is a dictionary that is a subset of collection with only entries matching our criteria.

But what if we want to search for a value. Let’s say we want create a new dictionary that is a subset of collection where the values equal 93. We will do this just like above.

equals93collection = {
    k: v for (k, v) in collection.items()
    if v == 93
}
equals93collection

{'x': 93, 'm': 93}

So we have 2 key: value pairs where the value met the condition we set. The result is a dictionary.

Before we go on, as a bit of an aside, what if we simply wanted to double check how many values are equal to 93. Nothing more, nothing less. We can combine a few of the methods we’ve highlight to get something like the following.

number_values_equal_to_93 = sum(
    1 for v in collection.values()
    if v == 93
)
number_values_equal_to_93

This might look slightly different, but it’s really all the same as we have done. Let’s start inside and work out. first we are iterating over the values of collection and since we only need the values and new key: value pairs, we use collection.values (and if we were looking at keys we would use collection.keys, but we would never be counting key occurrences with an equivalency because remember, we cannot have two of the same key in a dictionary). We are testing if the value equals 93 and accumulating an integer 1 rather than the value. We use 1 because we want to count the occurrences, not operate on them by summing the values. Finally, we count up these ones with sum.

Okay, I just wanted to highlight searching with comprehension and how we can use sum and specific values. Let’s move on.

While this was a “search” for values that equaled 93, we simple iterated over the entire collection to test each and every value. While this is common, another more “search”-like operation is to identify key: value entries where we now a condition is met, and we know how many exist. This could be finding the single user who’s hash equals a specific value.

Let’s first use our collection to find the two keys whose value equals 93, and since we know there exists only two entries, let’s do this with a generator so that we can stop as soon as we have found our two data points. There’s no reason we should search further than that; if 93 occurs at position 0 and 1, why should we be searching through the remainder of collection?

found = []
while len(found) < 2:
    for (k, v) in collection.items():
        if v == 93:
            found.append(k)
found

['x', 'm']

As you see, this is just like the previous method where we accumulated key: value pairs where the value equaled 93. The syntax is different, but the operation is very similar. The difference here is that we stop iterating over the items as soon as we find our known matches of 2.

However, there are some issues here if we were working with much larger data objects. When we call collection.items() we generate the entire list of key: value pairs for collection. This puts all of those in memory upon creation, and then we starting iterating over them. While it’s better in that we stop iterating as soon as our 2 are found, we still generate the entirety of collection’s key: value pairings.

Let’s make this better by only creating one key: value pair, check it, then move on if we need to do so.

found = []
for (i, (k, v)) in enumerate(collection.items()):
    if v == 93:
        print(f'Found a 93 on loop {i=} for {k=}')
        found.append(k)

    if len(found) == 2:
        break
    else:
        continue

print(f'collection is {len(collection)=} and we searched through {i=} key: value pairs')
found

Found a 93 on loop i=3 for k='x'
Found a 93 on loop i=8 for k='m'
collection is len(collection)=17 and we searched through i=8 key: value pairs

['x', 'm']

So you can see this almost identical to our last attempt, but we have created a generator using enumerate over our function. We then iterate this generator as it yields an integer telling us the iteration loop number (like a counter) and the tuple(key, value). So we assign each yield as (i, (k, v)), check if the value v equals 93, and if so, we can print out which loop count we are at, the key k it was associated with, and then append the key to our found list. If we have found 2, as len(found) == 2, then we break out of our loop, otherwise we continue to the next iterable.

You can then see that of our collection of length 17, we only searched through 9 pairs (enumerate started at 0 and ended at i=8 so that is 8 + 1 pairs). Pretty cool, especially if we are searching in the billions, or our search criteria is a lot more complex than an identify test. What if we were searching for the key were some costly function f(x) which took several {seconds/minutes/MBs/GBs} to compute yielded the resultant value v of our tuple(k, v) pair?

Another important feature is that we are storing in ram the current iterable, we are not inherently accumulating everything we have already iterated over. We are looking at things one at a time and moving on. It’s like we are at a market and searching for a good fruit. We sample many fruits by taking one, looking it over, and either keeping it because it matches our criteria for ripeness, or we put it back. We do not stash in our arms every fruit we sample and then only after exhausting the entire bin of fruits do we put them down, keeping our selections. For shopping the market, iterating by yielding one fruit at a time, our arms (and the farmer) are thankful. For our programming, our RAM capacity is thankful.

We see this commonly with operating on file objects where we readily can encounter files >> size of our usable memory. This also extremely common with large arrays and networks when using a GPU.

Let’s make a quick generator just to see what’s really different here. First, though, let’s make a version like our list comprehension to see what that looks like, and then do the same thing with a generator to compare.

def list_building(n=10):
    """Generate a list of size 10"""
    created_array = []
    for i in range(n):
        created_array.append(i)

    return created_array

def generator_list_building(n=10):
    for i in range(n):
        yield i


complete_list = list_building(10)
print(f'{complete_list=} is {type(complete_list)=}')

iterable_list = generator_list_building(10)
print(f'{iterable_list=} is {type(iterable_list)=}')

complete_list=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] is type(complete_list)=<class 'list'>
iterable_list=<generator object generator_list_building at 0x7f8f315e9f90> is type(iterable_list)=<class 'generator'>

As you can see both by method, but also by inspection, the first list building routine creates, then returns the entire array of 10 integers. However, the generator method creates an in-memory generator which can later by iterated over. You can see the difference by printing out the result. complete_list contains a complete, 10 integer array. However, iterable_list has no data and is instead a generator object.

But how do we get something out of the generator? Let’s say for our list of objects, we wanted an array of squared values.

squared_list = [
    n ** 2 for n in complete_list
]
print(squared_list)

squared_generator_list = [
    n ** 2 for n in iterable_list
]
print(squared_generator_list)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Both resulted in our list of squared original values. However, with the first method what do we have in memory? We have the original list complete_list which is of length 10. We now also have a squared list squared_list also of length 10. These are small, but what if we simply wanted our squared list and it was to be a billion integers? Further, do we need the original if we just want squares. We could use del complete_list but we still had both complete lists in memory at some point for some duration.

The second, though started only with a generator. When we were done, we only had a resultant list and an empty generator. That’s because we told it what it will yield, we then iterated it having it yield a particular integer, we operated on that iterable and accumulated it, and we then threw away our yielded iterable before moving on. So as we iterate over the generator, we yield and then move on. Just like our fruit inspection at the market.

The great thing is, our squaring operation looks the same! Nice and simple, right? But our functions were a bit crude — can we write them in our comprehension style? Can we make a comprehension generator? Yes!

comprehension_based_complete_list = [
    i for i in range(10)
]
print(f'{comprehension_based_complete_list=} is {type(comprehension_based_complete_list)=}')

comprehension_based_iterable_list = (
    i for i in range(10)
)
print(f'{comprehension_based_iterable_list=} is {type(comprehension_based_iterable_list)=}')

comprehension_based_complete_list=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] is type(comprehension_based_complete_list)=<class 'list'>
comprehension_based_iterable_list=<generator object <genexpr> at 0x7f8f308f4890> is type(comprehension_based_iterable_list)=<class 'generator'>

Okay, there we have it. The same functions we already did but with list and generator comprehension. Remember, this is done through the subtle difference in using either [] for a list or () for a generator around the comprehension logic.

Now, this is not the best example. That’s because in Python 3, range is already a generator. So these would simply be list(range(10)) and range(10), respectively. But you should be able to see that what we are doing to make the data could be anything. Finally, as an exercise, let’s just combine everything in the last few functions to make a squared value list and generator. This should be a better example than range alone, and it will allow us to write something more succinct than the multi-step functions above.

comprehension_based_squared_list = [
    i ** 2 for i in range(10)
]
print(f'{comprehension_based_squared_list=} is {type(comprehension_based_squared_list)=}')

comprehension_based_squared_iterable = (
    i ** 2 for i in range(10)
)
print(f'{comprehension_based_squared_iterable=} is {type(comprehension_based_squared_iterable)=}')

comprehension_based_squared_list=[0, 1, 4, 9, 16, 25, 36, 49, 64, 81] is type(comprehension_based_squared_list)=<class 'list'>
comprehension_based_squared_iterable=<generator object <genexpr> at 0x7f8f308f4ac0> is type(comprehension_based_squared_iterable)=<class 'generator'>

So we have a squared list and a squared generator.

Let’s get back to our “searching”. Remember way back we used enumerate to create a generator-based version of our data for which we could iterate. Let’s take a look at another way we can do this.

collections_generator = iter(collection.items())

found = []
while len(found) < 2:
    k, v = next(collections_generator)
    if v == 93:
        print(f'Found a 93 for {k=}')
        found.append(k)

found

Found a 93 for k='x'
Found a 93 for k='m'

['x', 'm']

Here we created our own generate using iter which let’s us work through any collection of data (list, array, dict, tuple). We then use a while routine to iterate through the data until we have found our 2 matches collected under found and assessed with len(found). We yield and iteration of key: value by calling next on our generator, check if value v is equal to 93, and if so, we append it to found.

Great, works just like before but much more concise using while and using iter our own data object, we can make it adaptable. The benefit of enumerate is that we don’t have to yield the data ourselves using next and we also get the added feature of a built-in counter as the first value in the yielded tuple(count, iterable_item) where our iterable_item is a key: value pair as tuple(k, v) since we are iterating collection.items().

Either works, fine and there are some pros and cons to each beyond what we just discussed. Generally it comes down to habit and I find myself using both almost equally.

Now one caveat of our generator. What if we couldn’t find our second match. Either we were not working on immuatable data and it changed since we began the search, or set our match criteria with our prior knowledge. We will hit an iteration error and break. We always want to catch such exceptions, deal with them gracefully, and move on so we do not break our code.

So how do we go about doing this here?

collections_generator = iter(collection.items())

found = []
while len(found) < 2:
    try:
        k, v = next(collections_generator)
        if v == 53:
            print(f'Found a 53 for {k=}')
            found.append(k)
    except StopIteration:
        print('We ran out of data to search')
        break

found

Found a 53 for k='t'
We ran out of data to search

['t']

To catch this issue, we nest our iterable assignment and condition check within a try/except routine. Here, we will try to get the next iterable and test it. However, if we run out because there is nothing left to yield when we call next, iter will raise StopIteration. We catch this specific (and we always want to be specific) exception with except StopIteration.

To ensure we had a scenario where this exception would throw, we searched for a value where we know from our previous exercises that it occurred only once, here 53.

Pretty simple. But two important things to note.

We explicitly use except StopIteration to only catch this error. We do not want to catch any other issues that might arise as it would obfuscate any other errors since we assume running out of choices is all that could happen.
More critically we must make sure that whatever we do in our exception handling, whether it’s simply print out an issue warning, like here, or maybe we try and do something else, we have to call break.

If we do not call break when we run out of iterables, what will happen with our while routine? Well, we’ll never meet the condition len(found) < 2 and we will loop to infinite. (Okay, not really. We will generally hit a recursion warning in most environments, but it’s equally bad. Think about this happening in product!)

Another consideration is that we can use else and finally with our try/except routine. While I’ll leave that for another time, in brief, it allows us to run a secondary operation if the first exception does not through, and a way to always run a cleanup regardless if we succeeded or not (think file object close, etc). So else only runs if no exception was caught with except and finally will always run no matter the prior outcome.

Okay, I think we hit a lot of good stuff so far. We discussed constructing objects through comprehension building (list or dictionary), and generating some random toy data in doing so. We ran through a few different ways to summarize and understand our data using either built-ins or from scratch methods. We saw how we can “search” conditionally to find data or simply count occurrences. We then dove into using generators and iterating to optimize our searching. Finally, we briefly touched on adding some error handling.

Let’s take a look at one last method. What if we either knew our data had only one match, for example, our user with a specific. Let’s create a simple class object, User and highlight to things we can do wrap up everything we’ve gone over thus far.

class User:
    def __init__(self, name, age, active=True):
        self.name = name
        self.age = age
        self.active = active

    def toggle_active(self):
        self.active = not self.active

        return True

    def __repr__(self):
        return f'<User> {self.name=} | {self.age=} | {self.active}'

Okay, we now have a User class that represents a user with a name, age, and an active status. We also have a function that we can call on a User that will toggle the active status. We also set a __repr__ function so we can get a string representation of our users.

Let’s add some users, and we will store them in a list. So let’s do this with some list comprehension!

user_names = ['Patrick', 'Matthew', 'Linux Admin', 'Operating Doctor', 'Data Scientist']
users = [
    User(name=name, age=np.random.randint(80))
    for name in user_names
]
users

[<User> self.name='Patrick' | self.age=3 | True,
 <User> self.name='Matthew' | self.age=45 | True,
 <User> self.name='Linux Admin' | self.age=12 | True,
 <User> self.name='Operating Doctor' | self.age=46 | True,
 <User> self.name='Data Scientist' | self.age=74 | True]

So we used our list comprehension to build out a list of users as users from our preset user names array user_names. In doing so we assigned them a random age up to 80 with numpy.random.randint(80), similar to what we did with collection. Also note since we did not supply an active argument, it defaulted to True as we set in User.__init__. And because we made a nice __repr__, we get an informative string representation of all of our users in users.

Now let’s use what we have done already to perform a few somewhat real world operations.

I see two users whose age is under 18. Patrick and Linux Admin should not be active on our platform without additional parental consent. Let’s go ahead and take care of that. But let’s highlight a few methods to do this. I will comment out the first few methods and only run the last since we only want to toggle this once, but we want to highlight them all.

# # Pretty standard approach
for user in users:
    if user.age < 18:
        user.toggle_active()

# # This works, but map is a generator so to perform the mapped function we
# # have to iterate it - so for our simple purpose this is a bit terse
# list(map(lambda user: user.toggle_active(), (user for user in users if user.age < 18))

# # If we wanted to do the mapped approach, we could do it in a list comprehension
# # further, since we return True after toggling as long as there is no thrown error
# # we could add a check to make sure all requested toggling ran successfully
# assert all(user.toggle_active() for user in users if user.age < 18)

users

[<User> self.name='Patrick' | self.age=3 | False,
 <User> self.name='Matthew' | self.age=45 | True,
 <User> self.name='Linux Admin' | self.age=12 | False,
 <User> self.name='Operating Doctor' | self.age=46 | True,
 <User> self.name='Data Scientist' | self.age=74 | True]

As you can see, we simply ran a very readable loop to run User.toggle_active() on any user we encounter who’s age is less than 18. And we can see that, yes, it did indeed work.

I also commented out a few other approaches using more advanced, or in some cases merely more confusing, examples. The version with an assertion is pretty handy. You’d do something like this if we were unit testing, however, you don’t usually want any assertions in your production code. In production, though, we might say if not all(blah): send_alert("status toggle failures") or something like that. And again, we can do that without inspecting the users simply because we return True if the code executes properly. But remember, that’s not checking that the toggle resulted in the correct status state, only that it actually ran without raising an exception.

Okay, so we have our users setup, and their status is no age appropriate. Let’s come back to our techniques. Let’s find users who’s age is over 30 (so similar to what we just did) but now we want to do a bit more than a simple one-time function execution.

Remember, we can do this with a list or generator. Let’s pretend that our set of patients now should go into a new class object. Let’s also set the caveat that the number of users we find will be >> the actual 2 in our example. We might also assume that each User in users also contains addition attributes where one contains health history that might be megabytes in size. So because we have a few things we want to do with our found users, and each found user is memory intensive and already exists somewhere in the users list, we do not want to copy an entire subset of that list.

We are also going to pretend that our conditional patients can only accumulate in our intensive care unit so much before we fill our capacity. So because every bit of memory is important, remember all that patient history we naively read into memory when making our Users? And because we can only take so many, let’s use our generator approach.

*Note: if we were really memory intensive here and speed was important, we could use indexing and other tricks. But let’s assume we are somewhere between our example of 5 and Google big data.

users_iter = iter(users)
max_capacity = 2
intensive_care_patients = []
while len(intensive_care_patients) < max_capacity:
    try:
        patient = next(users_iter)
        if patient.age > 30:
            intensive_care_patients.append(patient.name)
    except StopIteration:
        print(f'We still have capacity!')

intensive_care_patients

['Matthew', 'Operating Doctor']

This should all seem standard fare now. So I will leave the breakdown to you at this point.

Notice how we only used the patient name in our resultant list? Again, we are assuming it’s incredibly expensive to hold a User in memory so let’s not duplicate things unless we have to do so. In reality, we might reference a database table primary key, or maybe a unique user hash since names are common. We also do not want people causing havoc on the system by guessing actual data references (thus, why we normally use hashes or other unpredictable identifiers), but I digress.

Okay, remember what I just said about copying. Well in reality, we could just reference the original object so we have a list of objects and another list of references to that object. But for our purposes, let’s just assume we are dealing with copies above where maybe we would otherwise attach additional information making the reference and the original no longer equivalent and causing us to have two objects for one user.

If we wanted to make ICU patient objects, we could have simply done so above. Rather than appending the User.name to a list, we could have simply made those objects and collected them. Let’s do that quick here just as an example.

class VulnerablePatient:
    def __init__(self, patient):
        self.patient = patient

    def __repr__(self):
        return f'<VulnerablePatient> {self.patient}'


users_iter = iter(users)
max_capacity = 2
intensive_care_patients = []
while len(intensive_care_patients) < max_capacity:
    try:
        patient = next(users_iter)
        if patient.age > 30:
            intensive_care_patients.append(
                VulnerablePatient(patient)
            )
    except StopIteration:
        print(f'We still have capacity!')

intensive_care_patients

[<VulnerablePatient> <User> self.name='Matthew' | self.age=45 | True,
 <VulnerablePatient> <User> self.name='Operating Doctor' | self.age=46 | True]

Okay, so one thing we did different here is that we passed the entire User object as the patient to our VulnerablePatient object. I will also point out that it may or may not be obvious that here we rely on the User.__repr__ to provide the actual patient information in VulnerablePatient.__repr__.

Here is where we can check to see if we have made two patients, one a User and one user User within VulnerablePatient or we are referencing the same User object and thus same memory block. Let’s check.

check_user = next(
    user for user in users
    if user.age > 30
)
compare_vulnerable_patient = next(
    patient for patient in intensive_care_patients
    if patient.patient.name == check_user.name
)

print(check_user, compare_vulnerable_patient)

<User> self.name='Matthew' | self.age=45 | True <VulnerablePatient> <User> self.name='Matthew' | self.age=45 | True

Okay two things here. First, the simplest, jump down to the last line of code and the output after. We can see that the user we grabbed as check_user that met our condition for which we made ICU patients is properly found and is the same VulnerablePatient. How we condition that should be familiar now if it wasn’t before we started.

The second thing to note is that we combine a few of our methods, finally, into a nice and concise example.

We first want to get a user that we know would be a VulnerablePatient by conditioning the same way, here, age > 30. We do this in a generator because we now love them, but more imporantly we want to just get the first user that meets these conditions. We do not want to generate a full list of condition matching users because, remember, they are expensive, and we only need one and really fast. So to do this, we use a generator that will yield matching conditions, and we use next one time so we get the first iteration that is yielded.

You will see we do this in a shortcut by using a generator comprehension that is wrapped in next. We could have also done this like (generator comprehension).next(), but that’s not as clean, nor is it necessary in modern Python.

You’ll then see we do the same with our ICU patients list, but here we condition on the name being the same as our single check_user we just found. So we search for a single match (even though we know there are more) and then again, we search for a single match (but here we know there should be only one - yet our syntax is the same for our purposes). You might note, we did not break this out to catch a StopIteration because we know the data exists and each will successfully return. (We say that all the time in production and then things blow up, don’t they?)

Okay, back to it. We now have a User and a VulnerablePatient that should be the same person. But is this a copy or a reference to one memory block? Let’s check.

print(f'{id(check_user)=}, {id(compare_vulnerable_patient)=}, and {id(compare_vulnerable_patient.patient)=})', end='\n\n')

def assert_equivalency(obj1, obj2):
    """Assert the equivalency of two objects"""
    try:
        assert id(obj1) == id(obj2)
    except AssertionError:
        print(f'{obj1=} and {obj2=} are not equivalent', end='\n\n')
    else:
        print(f'{obj1=} and {obj2=} are equivalent', end='\n\n')


assert_equivalency(check_user, compare_vulnerable_patient)
assert_equivalency(check_user, compare_vulnerable_patient.patient)

id(check_user)=140252981884720, id(compare_vulnerable_patient)=140252983924240, and id(compare_vulnerable_patient.patient)=140252981884720)

obj1=<User> self.name='Matthew' | self.age=45 | True and obj2=<VulnerablePatient> <User> self.name='Matthew' | self.age=45 | True are not equivalent

obj1=<User> self.name='Matthew' | self.age=45 | True and obj2=<User> self.name='Matthew' | self.age=45 | True are equivalent

So a few ways to look at it. The first we simply print the result of the id function which will tell us the object’s unique identifier in memory. The second way is to assert the equivalency of the id. And the third was is to directly check equivalency by comparing.

As you can see, all three methods work and give us the same result. While the User and the VulnerableUser are not the same object, which is what we expect, the VulnerableUser.patient and the User are the same. So we did not create a second in-memory patient when we put them in a VulnerableUser object.

Okay, this last part was a bit of a tangent from what we were going through up to this point. However, we did a nice culmination of our methods in getting check_user and compare_vulnerable_patient, and then hopefully the explanation of in-memory and referencing, and how we can check these things, proves to be helpful.

I think that should do it for generators and generator comprehensions, and when we might want to use them over lists and list comprehensions.

Dictionary Lookup - Exploring the Depths Notes on MLOps - One

Python Generators and Comprehension#

This Page