• Re: Organizing modules and their code

    From 2QdxY4RzWzUUiLuE@potatochowder.com@21:1/5 to transreductionist on Fri Feb 3 17:14:51 2023
    On 2023-02-03 at 13:18:46 -0800,
    transreductionist <transreductionist@gmail.com> wrote:

    Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline. The discussion concerns the
    Python language, which supports OOP as well as Structural/Functional approaches to programming.

    I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that
    this is one of those topics where people may have strong opinions one
    way or the other. I am interested in those opinions.

    Okay, I'll start: unless one of extract, transform, or load is already,
    or will certainly at some point become, complex/complicated enough to be
    its own architectural module with its own architectural substructure; or
    you're constructing specific ETL pipelines for specific ETL jobs at the
    times the jobs are defined; then I think you're overthinking it.

    Note that I say that speaking as a notorious overthinker. ;-)

    Keep It Simple: Put all four modules at the top level, and run with it
    until you falsify it. Yes, I would give you that same advice no matter
    what language you're using.

    FWIW, I'm not a big fan of OO, but based on what little I know about
    your ETL pipelines, I agree with you that it probably doesn't make a big difference at this level. Define solid (in pretty much any/every sense
    of the word, capitalized or not) interfaces between your modules, and
    write your code against those interfaces, whether OO or any other
    paradigm.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From transreductionist@21:1/5 to All on Fri Feb 3 13:18:46 2023
    Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline.
    The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.

    I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those
    opinions.

    Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus
    complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.

    I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software
    Architecture: SA Martin). I think Design I best adheres to these principles of: ---- Tolerate change,
    ---- Are easy to understand, and
    ---- Are the basis of components that can be used in many software systems.

    I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers
    directory is at the same level of abstraction.

    I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and
    services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
    albeit at a much higher level (SA Martin).

    One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.

    Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf

    ---- (https://www.cs.drexel.edu/~yfcai/papers/2019/tse2019.pdf)

    SEVERAL DESIGNS FOR COMPARISON

    DESIGN I:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract.py
    ---- transform.py
    ---- load.py

    Of course one could also

    DESIGN II:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract_transform_load.py

    or probably even:

    DESIGN III:

    ---- manage_the_etl_pipeline.py
    ---- extract_transform_load.py

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to transreductionist on Fri Feb 3 17:31:26 2023
    On 2/3/2023 4:18 PM, transreductionist wrote:
    Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline.
    The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.

    I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those
    opinions.

    Well, you have pretty well stacked the deck to make DESIGN 1 the
    obviously preferred choice. I don't think it has much to do with Python
    per se, or even with OO vs imperative style.

    As a practical matter, once you got into working with
    extract_transform_load.py (for the other designs), I would expect that
    you would start wanting to refactor it and eventually end up more like
    DESIGN 1. So you might as well start out that way.

    The reasons are 1) what you said about separation of concerns, 2) a
    desire to keep each module or file relatively coherent and easy to read,
    and 3, as you also suggested, making each of them easier to test.
    Decoupling is important too (one of the SOLID prescriptions), but you
    can violate that with any architecture if you don't think carefully
    about what you are doing.

    On the subject of OO, I think it is a very good approach to think about architecture and design in object terms - meaning conceptual objects
    from the users' point of view. For example, here you have a pipeline (a metaphorical or userland object). It will need functionality to load, transform, and output data so logically it can be composed of a loader,
    one or more transformers, and one or more output formatters (more
    objects). You may also need a scheduler and a configuration manager
    (more objects).

    (*Please* let's not have any quibbling about "class" vs "object". We
    are at a conceptual level here!)

    When it comes to implementation, you can choose to implement those
    userland objects with either imperative, OO, or functional techniques,
    or a mixture.


    Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus
    complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.

    I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software
    Architecture: SA Martin). I think Design I best adheres to these principles of:
    ---- Tolerate change,
    ---- Are easy to understand, and
    ---- Are the basis of components that can be used in many software systems.

    I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers
    directory is at the same level of abstraction.

    I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and
    services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
    albeit at a much higher level (SA Martin).

    One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.

    Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf

    ---- (https://www.cs.drexel.edu/~yfcai/papers/2019/tse2019.pdf)

    SEVERAL DESIGNS FOR COMPARISON

    DESIGN I:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract.py
    ---- transform.py
    ---- load.py

    Of course one could also

    DESIGN II:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract_transform_load.py

    or probably even:

    DESIGN III:

    ---- manage_the_etl_pipeline.py
    ---- extract_transform_load.py

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to transreductionist on Fri Feb 3 23:49:49 2023
    transreductionist <transreductionist@gmail.com> writes:
    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract.py
    ---- transform.py
    ---- load.py

    I don't make such de,er,cisions upfront.

    I start out with one file. That would be your "manage_the_etl_
    pipeline.py" I guess. Then, I write everything into that file.
    I would split out a module from this when I'd see need for it.

    The module "__init__.py" of tkinter, for example, has more than
    4500 lines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Thomas Passin on Sat Feb 4 00:01:01 2023
    Thomas Passin <list1@tompassin.net> writes:
    As a practical matter, once you got into working with >extract_transform_load.py (for the other designs), I would expect that
    you would start wanting to refactor it and eventually end up more like
    DESIGN 1. So you might as well start out that way.

    Upfront designs are more possible when someone already has
    experience with similar projects. Then he can take some
    "shortcuts" and look a bit into the future, as you suggest.

    (*Please* let's not have any quibbling about "class" vs
    "object". We are at a conceptual level here!)

    Talking about classes vs. objects /is/ talking on a
    conceptual level.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From transreductionist@21:1/5 to Thomas Passin on Fri Feb 3 16:08:36 2023
    On Friday, February 3, 2023 at 5:31:56 PM UTC-5, Thomas Passin wrote:
    On 2/3/2023 4:18 PM, transreductionist wrote:
    Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the
    pipeline. The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.

    I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those
    opinions.
    Well, you have pretty well stacked the deck to make DESIGN 1 the
    obviously preferred choice. I don't think it has much to do with Python
    per se, or even with OO vs imperative style.

    As a practical matter, once you got into working with extract_transform_load.py (for the other designs), I would expect that
    you would start wanting to refactor it and eventually end up more like DESIGN 1. So you might as well start out that way.

    The reasons are 1) what you said about separation of concerns, 2) a
    desire to keep each module or file relatively coherent and easy to read,
    and 3, as you also suggested, making each of them easier to test.
    Decoupling is important too (one of the SOLID prescriptions), but you
    can violate that with any architecture if you don't think carefully
    about what you are doing.

    On the subject of OO, I think it is a very good approach to think about architecture and design in object terms - meaning conceptual objects
    from the users' point of view. For example, here you have a pipeline (a metaphorical or userland object). It will need functionality to load, transform, and output data so logically it can be composed of a loader,
    one or more transformers, and one or more output formatters (more
    objects). You may also need a scheduler and a configuration manager
    (more objects).

    (*Please* let's not have any quibbling about "class" vs "object". We
    are at a conceptual level here!)

    When it comes to implementation, you can choose to implement those
    userland objects with either imperative, OO, or functional techniques,
    or a mixture.
    Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple
    versus complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.

    I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software
    Architecture: SA Martin). I think Design I best adheres to these principles of:
    ---- Tolerate change,
    ---- Are easy to understand, and
    ---- Are the basis of components that can be used in many software systems.

    I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers
    directory is at the same level of abstraction.

    I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and
    services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
    albeit at a much higher level (SA Martin).

    One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.

    Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf

    ---- (https://www.cs.drexel.edu/~yfcai/papers/2019/tse2019.pdf)

    SEVERAL DESIGNS FOR COMPARISON

    DESIGN I:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract.py
    ---- transform.py
    ---- load.py

    Of course one could also

    DESIGN II:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract_transform_load.py

    or probably even:

    DESIGN III:

    ---- manage_the_etl_pipeline.py
    ---- extract_transform_load.py


    On point that I think is worth making ,and I forgot to make it, is that namespaces are ubiquitous in Python: Built-in, Global, Function, and Enclosing namespaces, as well as user namespaces, e.g. dictionaries, the SimpleNamespace, and DataClasses to
    list just a few. Modules ARE namespaces. Namespaces organize programming constructs like classes, functions, variables, etc. into coherent groups of "things". To have a namespace that complects extract constructs with transform constructs, and load
    constructs in one module seems un-pythonistic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to 2QdxY4RzWzUUiLuE@potatochowder.com on Fri Feb 3 22:24:03 2023
    On 2/3/2023 5:14 PM, 2QdxY4RzWzUUiLuE@potatochowder.com wrote:
    Keep It Simple: Put all four modules at the top level, and run with it
    until you falsify it. Yes, I would give you that same advice no matter
    what language you're using.

    In my recent message I supported DESIGN 1. But I really don't care much
    about the directory organization. It's designing modules whose business
    is to handle various kinds of operations that counts, not so much the
    actual directory organization.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dn@21:1/5 to Thomas Passin on Sat Feb 4 18:24:15 2023
    On 04/02/2023 16.24, Thomas Passin wrote:
    On 2/3/2023 5:14 PM, 2QdxY4RzWzUUiLuE@potatochowder.com wrote:
    Keep It Simple:  Put all four modules at the top level, and run with it
    until you falsify it.  Yes, I would give you that same advice no matter
    what language you're using.

    In my recent message I supported DESIGN 1.  But I really don't care much about the directory organization.  It's designing modules whose business
    is to handle various kinds of operations that counts, not so much the
    actual directory organization.

    +1 (and to comments made in preceding post)

    With ETL the 'reasons to change' (SRP) come from different 'actors'. For example, the data-source may be altered either in format or by changing
    the tool you'll utilise to access. Accordingly, the virtue of keeping it separate from other parts. If you have multiple data-sources, then each
    should be separate for the same reason.

    The transform is likely dictated by your client's specification. So,
    another separation. Hence Design 1.

    There is a strong argument for suggesting that we're going out of our
    way to imagine problems or future-changes (which may never happen). If
    this is (definitely?) a one-off, then why-bother? If permanence is
    likely, (so many 'temporary' solutions end-up lasting years!) then
    re-use can?should be considered.

    Thus, when it comes to loading the data into your own DB; perhaps this
    should be separate, because it is highly likely that the mechanisms you
    build for loading will be matched by at least one 'someone else' wanting
    to access the same data for the desired end-purposes. Accordingly, a
    shareable module and/or class for that.


    We can't see the code-structure, so some of the other parts of your
    question(s) are too broad. Here's hoping you and Liskov have a good time together...


    My preference is for (what I term) the 'circles' diagram (see copy at https://mahu.rangi.cloud/CraftingSoftware/CleanArchitecture.jpg). This illustrates the 'rule' that code handling the inner functionality not
    know what happens at the more detailed/lower-level functional level of
    the outer rings.

    With ETL, there's precious little to embody various circles, but the
    content of the outer ring is obvious. The "T" rules comprise the inner
    "Use Case", even if you eschew "Entities" insofar as OOP-avoidance is concerned. This 'inversion', where the inner controls don't need to care
    about the details of outer-ring implementation (is it an RDBMS, MySQL or Postgres; or is it some NoSQL system?) brings to life the "D" of SOLID,
    ie Dependency Inversion.


    You may pick-up some ideas or reassurance from "Making a Simple Data
    Pipeline Part 1: The ETL Pattern" (https://www.codeproject.com/Articles/5324207/Making-a-Simple-Data-Pipeline-Part-1-The-ETL-Patte).

    Let us know how it turns-out...
    --
    Regards,
    =dn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to dn via Python-list on Sat Feb 4 01:01:45 2023
    On 2/4/2023 12:24 AM, dn via Python-list wrote:
    The transform is likely dictated by your client's specification. So,
    another separation. Hence Design 1.

    There is a strong argument for suggesting that we're going out of our
    way to imagine problems or future-changes (which may never happen). If
    this is (definitely?) a one-off, then why-bother? If permanence is
    likely, (so many 'temporary' solutions end-up lasting years!) then
    re-use can?should be considered.

    With practice, it gets to be more automatic to set things up from the
    beginning to more-or-less honor separation of concerns, decoupled
    modules and APIs, and so forth. Doing this does not require a full, future-proof suite of alternative database adapters, for example, right
    from the start. On top of everything else, you can't know the future
    perfectly. And you can't know enough at the beginning to get every
    design and architectural path optimal. You learn as you go.

    I have a Tomcat application where I separated the output formatting from
    the calculation of results. At the time I wrote only an XML formatter.
    A decade later, here comes JSON, and customers are asking about it. I
    was able to write a JSON formatter with the same API in about half an
    hour, and now we have optional JSON output. Separating out the
    formatting functionality with its own API was not an example of wasting
    time with YAGNI (You Aren't Going To Need It), it was just plain good
    practice that probably didn't even cost me any more development time -
    since it simplified the calculation code.

    OTOH, you may be just trying to learn how to do the bits and pieces. You
    may be learning how to connect to the database. You may be learning how
    to make a pipeline multithreaded. You have to experiment a thousand
    ways, and in a hurry. Until you learn how to do the basic techniques,
    sure, quick and dirty is fine. But it shouldn't be the way you design
    your actual product, unless it's just for you and needs to be done
    quickly, and will probably be ephemeral.

    Why do I get the feeling that the OP was asking about a homework problem?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Weatherby,Gerard@21:1/5 to All on Sat Feb 4 11:17:46 2023
    You’re overthinking it. It doesn’t really matter. Having small chunks of codes in separate files can be hassle when trying to find out what the program does. Having one file with 2,000 lines in it can be a hassle. This is art / opinion, not science.

    From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of transreductionist <transreductionist@gmail.com>
    Date: Friday, February 3, 2023 at 4:48 PM
    To: python-list@python.org <python-list@python.org>
    Subject: Organizing modules and their code
    *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

    Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline.
    The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.

    I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those
    opinions.

    Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus
    complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.

    I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software
    Architecture: SA Martin). I think Design I best adheres to these principles of: ---- Tolerate change,
    ---- Are easy to understand, and
    ---- Are the basis of components that can be used in many software systems.

    I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers
    directory is at the same level of abstraction.

    I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and
    services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
    albeit at a much higher level (SA Martin).

    One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.

    Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf

    ---- (https://urldefense.com/v3/__https://www.cs.drexel.edu/*yfcai/papers/2019/tse2019.pdf__;fg!!Cn_UX_p3!jcpCdxiLoPobR0IdlyJHwyPiNP4_iVC6dAMtg_HsLr5hStszx-WnYyZQHJ-4pJTOGsw4-6pEGJyDpSytZQqfpvATg06FMA$ )

    SEVERAL DESIGNS FOR COMPARISON

    DESIGN I:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract.py
    ---- transform.py
    ---- load.py

    Of course one could also

    DESIGN II:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract_transform_load.py

    or probably even:

    DESIGN III:

    ---- manage_the_etl_pipeline.py
    ---- extract_transform_load.py
    -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!jcpCdxiLoPobR0IdlyJHwyPiNP4_iVC6dAMtg_HsLr5hStszx-WnYyZQHJ-4pJTOGsw4-6pEGJyDpSytZQqfpvBaJ2e2VA$<https://urldefense.com/v3/__https:/mail.python.org/mailman/
    listinfo/python-list__;!!Cn_UX_p3!jcpCdxiLoPobR0IdlyJHwyPiNP4_iVC6dAMtg_HsLr5hStszx-WnYyZQHJ-4pJTOGsw4-6pEGJyDpSytZQqfpvBaJ2e2VA$>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From transreductionist@21:1/5 to transreductionist on Sat Feb 4 14:18:35 2023
    Thank you for all the helpful replies and consideration. I do hope for other opinions

    I would rather say it is more like engineering than art. Whether it is a matter of overthinking, or under thinking, is another matter. I enjoyed Dijkstra's letter to his colleagues on the role of scientific thought in computer programming. It is located
    at:

    ---- https://www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/EWD447.html

    It is my academic training in physics that makes me enjoy picking up an idea and examining it from all sides, and sharing thoughts with friends. Just inquisitive curiosity, and not a homework problem,. Thanks for the great link to the ETL site. That was
    a good read. A few years back I built a prod ETL application in Golang using gRPC with a multiprocessing pipeline throughout. It handled GB of data and was fast.

    This analogy came to me the other day. For me, I would rather walk into a grocery store where the bananas, apples, and oranges are separated in to their own bins, instead of one common crate.


    On Friday, February 3, 2023 at 4:18:57 PM UTC-5, transreductionist wrote:
    Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline.
    The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.

    I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those
    opinions.

    Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus
    complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.

    I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software
    Architecture: SA Martin). I think Design I best adheres to these principles of:
    ---- Tolerate change,
    ---- Are easy to understand, and
    ---- Are the basis of components that can be used in many software systems.

    I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers
    directory is at the same level of abstraction.

    I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and
    services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
    albeit at a much higher level (SA Martin).

    One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.

    Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf

    ---- (https://www.cs.drexel.edu/~yfcai/papers/2019/tse2019.pdf)

    SEVERAL DESIGNS FOR COMPARISON

    DESIGN I:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract.py
    ---- transform.py
    ---- load.py

    Of course one could also

    DESIGN II:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract_transform_load.py

    or probably even:

    DESIGN III:

    ---- manage_the_etl_pipeline.py
    ---- extract_transform_load.py

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Greg Ewing@21:1/5 to transreductionist on Sun Feb 5 13:26:53 2023
    On 5/02/23 11:18 am, transreductionist wrote:
    This analogy came to me the other day. For me, I would rather walk into a grocery store where the bananas, apples, and oranges are separated in to their own bins, instead of one common crate.

    On the other hand, if the store has an entire aisle devoted to each
    fruit, but only ever one crate of fruit in each aisle, one would think
    they could make better use of their shelf space.

    --
    Greg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Weatherby,Gerard@21:1/5 to transreductionist on Sun Feb 5 15:23:20 2023
    Well, first of all, while there is no doubt as to Dijkstra’s contribution to computer science, I don’t think his description of scientific thought is correct. The acceptance of Einstein’s theory of relativity has nothing to do with internal consistency
    or how easy or difficult to explain but rather repeatedly experimental results validating it. Or, more precisely, not disproving it. See Feynmann: https://www.youtube.com/watch?v=0KmimDq4cSU


    Engineering is simply maximizing the ratio: benefit / cost. Highly recommend To Engineer is Human by Henry Petroski.

    Regarding the initial question: none of the suggested designs would work because they lack __init__.py file.

    Once the __init__.py is added, the construct of the import statements within it will determine how the API looks. All three of Design I / Design II and Design III can be implemented with the same API. (I’m pretty sure that’s true. If it’s not, I’d be
    interested in a counterexample).





    From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of transreductionist <transreductionist@gmail.com>
    Date: Saturday, February 4, 2023 at 7:42 PM
    To: python-list@python.org <python-list@python.org>
    Subject: Re: Organizing modules and their code
    *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

    Thank you for all the helpful replies and consideration. I do hope for other opinions

    I would rather say it is more like engineering than art. Whether it is a matter of overthinking, or under thinking, is another matter. I enjoyed Dijkstra's letter to his colleagues on the role of scientific thought in computer programming. It is located
    at:

    ---- https://urldefense.com/v3/__https://www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/EWD447.html__;!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3jd5Pi2fw$<https://urldefense.com/v3/__https:/www.cs.
    utexas.edu/users/EWD/transcriptions/EWD04xx/EWD447.html__;!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3jd5Pi2fw$>

    It is my academic training in physics that makes me enjoy picking up an idea and examining it from all sides, and sharing thoughts with friends. Just inquisitive curiosity, and not a homework problem,. Thanks for the great link to the ETL site. That was
    a good read. A few years back I built a prod ETL application in Golang using gRPC with a multiprocessing pipeline throughout. It handled GB of data and was fast.

    This analogy came to me the other day. For me, I would rather walk into a grocery store where the bananas, apples, and oranges are separated in to their own bins, instead of one common crate.


    On Friday, February 3, 2023 at 4:18:57 PM UTC-5, transreductionist wrote:
    Here is the situation. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline.
    The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming.

    I am interested in opinions on which design adheres best to standard architectural practices and the SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those
    opinions.

    Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus
    complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.

    I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs. The goal of the SOLID principles are the creation of mid-level software structures that (Software
    Architecture: SA Martin). I think Design I best adheres to these principles of:
    ---- Tolerate change,
    ---- Are easy to understand, and
    ---- Are the basis of components that can be used in many software systems.

    I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers
    directory is at the same level of abstraction.

    I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and
    services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming,
    albeit at a much higher level (SA Martin).

    One can point to multiple reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.

    Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf

    ---- (https://urldefense.com/v3/__https://www.cs.drexel.edu/*yfcai/papers/2019/tse2019.pdf__;fg!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3jaresNFQ$ )

    SEVERAL DESIGNS FOR COMPARISON

    DESIGN I:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract.py
    ---- transform.py
    ---- load.py

    Of course one could also

    DESIGN II:

    ---- manage_the_etl_pipeline.py
    ---- etl_helpers
    ---- extract_transform_load.py

    or probably even:

    DESIGN III:

    ---- manage_the_etl_pipeline.py
    ---- extract_transform_load.py
    -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3hpaHTfyQ$<https://urldefense.com/v3/__https:/mail.python.org/mailman/
    listinfo/python-list__;!!Cn_UX_p3!nME8OhiOxAzmzM3jzg6uXZU851dhWWD9JGB8ZRZIzyUzGkmCN-C6SSXrL59eA2KVIh-y-W0VycJSNb8aYcNnc3hpaHTfyQ$>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Greg Ewing@21:1/5 to Gerard on Mon Feb 6 13:15:06 2023
    On 6/02/23 4:23 am, Weatherby,Gerard wrote:
    Well, first of all, while there is no doubt as to Dijkstra’s contribution to computer science, I don’t think his description of scientific thought is correct. The acceptance of Einstein’s theory of relativity has nothing to do with internal
    consistency or how easy or difficult to explain but rather repeatedly experimental results validating it.

    I don't think Dijkstra was claiming that what he was talking about
    was a *complete* description of scientific thought, only that the
    ability to separate out independent concerns is an important part
    of it, and that was something he saw his colleagues failing to do.

    --
    Greg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)