.. Dictionary Lookup - Exploring the Depths

.. post:: Nov 13, 2020
   :tags: atag
   :author: Matthew Martz

Dictionary Lookup - Exploring the Depths
========================================
Exploring methods of performant Python dictionary lookups

.. panels::
    :column: col-lg-12 p-0
    :header: font-weight-bold bg-info

    Note
    ^^^^^^^^^^^^^^
    This is an in-progress draft.

Introduction
------------

Some introduction text here.

The Bread and Butter
--------------------
Let us start with a simple, nested dictionary lookup.

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Find the value of our nested key
    ^^^^^^^^^^^^^^

    .. code:: python

        data_dict['this']['that']['the_other']

As you likely know, we will raise a KeyError exception if any of the keys within our 
lookup chain are missing. If we want to fallback to a default value, we can wrap this 
to return our default with a failed retrieval.

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Example Code
    ^^^^^^^^^^^^^^
    .. code:: python

        default = None

        try:
            return data_dict['this']['that']['the_other']
        except KeyError:
            return default


A Simple Example
----------------

Setup the example here before we define things.

``data_dict`` is a dict where we expect in a perfect world we would have
our expected structure and could simply query the deepest nested level
for ``our_key`` and default to ``default_value``. But we know we can't
expect anything either than ``data_dict`` exists and is a dict

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Three example methods for key lookup
    ^^^^^^^^^^^^^^

    .. code:: python

        data_dict = {}
        default_value = 1.0


        def old_style():
            """Old style (Python < 3.8)"""
            if 'top_level' in data_dict:
                if 'lower_level' in data_dict['top_level']:
                    if 'our_key' in data_dict['top_level']['lower_level']:
                        return data_dict['top_level']['lower_level']['our_key']
            else:
                return default_value


        def walrus_style():
            """Walrus operator cleans this up"""
            if (top := data_dict.get('top_level', None)):
                if (lower := top.get('lower_level', None)):
                    return lower.get('our_key', default_value)
            else:
                return default_value


        def chaining_style():
            """Chaining with object defaults"""
            return (
                data_dict.get("top_level", {})
                .get("lower_level", {})
                .get("our_key", default_value)
            )

Try with an empty ``data_dict`` We should expect the first two to be
faster as they break quicker

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Timing our methods
    ^^^^^^^^^^^^^^

    .. code:: python

        %timeit old_style()
        %timeit walrus_style()
        %timeit chaining_style()

    .. code:: text

        152 ns ± 0.825 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
        204 ns ± 2.71 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
        366 ns ± 1.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Indeed, if we do not have our initial key, our chained lookup is longer
since it proceeds through multiple lookups. Our walrus method is a
little slower since we make an assignment. So the simple nested hash
lookups will fail faster.

Now let's use a semi-populated ``data_dict`` We should expect some
convergence on timing now that we remove the early fail

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Failing early in our lookup
    ^^^^^^^^^^^^^^

    .. code:: python

        data_dict = {
            'top_level': {'lower_level': {'not_our_key': -1}}
        }

        %timeit old_style()
        %timeit walrus_style()
        %timeit chaining_style()

    .. code:: text

        303 ns ± 1.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
        373 ns ± 0.586 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
        359 ns ± 0.543 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Now we do see a bit more convergence on timing since we have to recurse
with all methods. And the first method still wins by using only hash
lookups as opposed to multiple rounds of assignment. What intrigues me
is that the chained assignment is faster than the walrus operator method
(will do some PEP research, this is a good CS exercise).

Now let's try the last scenario when our key is indeed in our structure

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Finding our nested key
    ^^^^^^^^^^^^^^

    .. code:: python

        data_dict = {'top_level': {'lower_level': {'our_key': -1}}}
        %timeit old_style()
        %timeit walrus_style()
        %timeit chaining_style()

    .. code:: text

        404 ns ± 5.23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
        370 ns ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
        372 ns ± 1.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Now we get interesting. If we have our key, the first method only using
hash lookups is slower when we will end up getting our key of interest.
We have to identify lookup, then ultimately do it all again to get our
value. However, with walrus and chaining assignment, we are doing a
lookup *and* storing of the value as we go. So when we get to the final
level of recursion we can simply return what we have. We also now see
that the walrus and chaining methods have equalized.

Summarizing Our Toy Data Experiments
------------------------------------

To summarize, continual assignment works nice when we can fairly
confidently expect to recurse through most of our data structure and not
fail immediately. The walrus operator introduced in Python 3.8 is truly
wonderful, and one of my favorite Python 3 additions (I know, you
probably said Async or division). Without further testing, we are left
primarily to a style decision.

While I often use the chaining approach with empty dict structures as
defaults to continue the chain, and personally find it a clean syntax,
the walrus operator approach has (at least) two primary benefits: 1)
while debugging we would have the value of the last successful lookup
regardless of where it failed, and 2) I think it's a bit more
understandable to someone who might not be as familiar with the idea of
chaining an object. As an additional note regarding the first point, we
could obtain the value of our last lookup by assigning it. However, when
we do that, we are simply performing the task the walrus operator was
introduced to provide. Another [+1] for the walrus operator method.

Further, if we would be working with large, messy nesting and
performance was critical -- and we will presume so -- we know from many
nights of banging our heads against the keyboard that we can never
expect a consistent data structure. I know we are all thinking of that
master piece of a JSON dump we get from some third party service we are
required to utilize that contains inconsistent structure and data types.
That one for which we will have to parse at 3am again and again.
Therefore, our method should be performant at any level of lookup
failure. Our chaining method will always get or assign all the way
through the intended structure. If we expect a very large recursion, but
we don't actually have it, we would have to use the time and memory to
unecessarily work through all of that failure. That would leave us with
the nested if/else or the walrus style. So again, another [+1] for the
walrus operator method. Are you seeing the theme?

Moving Beyond Our Toy Data - Let's Make Some Real Improvements
--------------------------------------------------------------

We used some fixed key lookups here. This works okay for a shallow
search depth. However, I would say that anything beyond four keys is a
bit unwieldy. In this case, and likely if we are going to do any depth
of search consistently in a a module, we should create a more
programmatic approach that might step through a list of keys. I'll end
with a very quick example of how I might implement such an approach, but
note, we would want to write some tests around this -- it is merely a
quick starting block.

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    A programmatic and modular method
    ^^^^^^^^^^^^^^

    .. code:: python

        def programmatic_depth_search(data_dict, key_list, default_value=None):
            current_depth_value = data_dict
            while key_list:
                try:
                    lookup_key = key_list.pop(0)
                    if not (
                        current_depth_value := current_depth_value.get(lookup_key, None)
                    ):
                        return default_value
                    else:
                        continue
                except AttributeError:
                    return default_value

            return current_depth_value

Then we might try the following examples:

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Timing our methods
    ^^^^^^^^^^^^^^

    .. code:: python

        search_value = programmatic_depth_search(
            data_dict={'top_level': {'lower_level': {'our_key': 3}}},
            key_list=['top_level', 'lower_level'],
            default_value='Not Found'
        )
        # Returns: {'our_key': 3}

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Timing our methods
    ^^^^^^^^^^^^^^

    .. code:: python

        search_value = programmatic_depth_search(
            data_dict={'top_level': {'lower_level': {'our_key': 3}}},
            key_list=['top_level', 'lower_level', 'our_key'],
            default_value='Not Found'
        )
        # Returns: 3

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Timing our methods
    ^^^^^^^^^^^^^^

    .. code:: python

        search_value = programmatic_depth_search(
            data_dict={'top_level': {'lower_level': {'our_key': 3}}},
            key_list=['top_level', 'lower_level', 'our_key', 'fail_key'],
            default_value='Not Found'
        )
        # Returns: 'Not Found'

Revisiting an Old Friend
------------------------

With the hindsight of writing a testable and modularized version of our
dictionary searcher, let's revisit our first three comparisons. We were
essentially comparing by always assuming we would either find
``our_key`` and return it's value, or it would not be found and we would
return ``default_value``. However, let's try this:

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Breaking our original functions with unexpected types
    ^^^^^^^^^^^^^^

    .. code:: python

        data_dict = {'top_level': {'lower_level': 1}}
        old_style()
        # Returns: TypeError: argument of type 'int' is not iterable

        walrus_style()
        # Returns: AttributeError: 'int' object has no attribute 'get'

        chaining_style()
        # Returns: AttributeError: 'int' object has no attribute 'get'

As you can clearly see, our fixed methods, while sufficient for our
specific initital example case, they break easily. We have tricked them
into finding a key at a particular level, but it is now an integer and
not a ``dict``. I wanted to highlight this to highlight that in
modularizing our experiment, we also added robustness against the
unexpected -- something we know is key. To prove this, let's now run
this with our modularized version:

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Robustness with our programmatic approach
    ^^^^^^^^^^^^^^

    .. code:: python

        search_value = programmatic_depth_search(
            data_dict, ['top_level', 'lower_level', 'our_key'], None
        )
        # Returns: None (our default value)

Neat!

What About the Simple Approach
------------------------------

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    A simple, fixed lookup
    ^^^^^^^^^^^^^^

    .. code:: python

        def normal():
            try:
                return data_dict['top_level']['lower_level']['our_key']
            except KeyError:
                return None


.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

   Comparing a fixed lookup to our programmatic approach
    ^^^^^^^^^^^^^^

    .. code:: python

        %timeit normal
        154 ns ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

        %timeit programmatic_depth_search(data_dict, key_list)
        158 ns ± 9.12 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

        key_list = ['top_level', 'lower_level', 'our_key']

        %timeit programmatic_depth_search(data_dict, key_list)
        156 ns ± 22.6 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Trying with a failure
    ^^^^^^^^^^^^^^

    .. code:: python

        def normal(default_value):
            try:
                return data_dict['top_level']['lower_level']['fail_key']
            except KeyError as e:
                return (e, default_value)

    .. code:: python

        %timeit normal(1)
        528 ns ± 98.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

        key_list = ['top_level', 'lower_level', 'fail_key']

        data_dict = {'top_level': {'lower_level': {'our_key': 3}}}

        %timeit programmatic_depth_search(data_dict, key_list, 1)
        164 ns ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

        key_list = ['top_level', 'fail_key', 'our_level']

        %timeit programmatic_depth_search(data_dict, key_list, 1)
        163 ns ± 30 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

.. panels::
    :column: col-lg-12 p-0
    :header: text-secondary

    Breaking our original functions with unexpected types
    ^^^^^^^^^^^^^^

    .. code:: python

        def normal(default_value):
            try:
                return data_dict['top_level']['fail_key']['our_key']
            except KeyError as e:
                return (e, default_value)

    .. code:: python

        %timeit normal(1)
        390 ns ± 11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

.. update:: November 13, 2020 Updated with some additional content