Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Enhancement Proposal: Wilkinson Formulas and New Model Interfaces #101

Open
tdhoffman opened this issue Jul 26, 2022 · 2 comments
Open
Labels

Comments

@tdhoffman
Copy link

Pursuant to the discussion in #95, I've implemented a version of Wilkinson formulas for spatial lag and spatial error models. The code is available on the formula branch of my fork of spreg. spreg.from_formula() parses a Wilkinson formula and returns a fitted OLS, spatial lag, or spatial error model depending on the user's inputs. The function signature is:

spreg.from_formula(formula, df, w=None, method="gm", debug=False, **kwargs)
where:

  • formula is a string following formulaic's syntax and the below additional syntax
  • df is a DataFrame or GeoDataFrame
  • w is a libpysal.weights.W object
  • method is a string describing the estimation method (for spatial lag or error models)
  • debug is a boolean which (when true) outputs the parsed formula alongside the configured model
  • **kwargs are additional arguments to be passed to the model class

The new formula syntax comes in two parts:

  • The <...> operator:
    • Enclose variables (covariates or response) in angle brackets to denote a spatial lag.
    • < and > are not reserved characters in formulaic, so there are no conflicts.
    • Usage: <FIELD> adds spatial lag of that field from the dataframe to model matrix.
    • Can use other transformations within <...>, e.g. <{10*FIELD1} + FIELD2> will be correctly parsed.
  • The & symbol:
    • Adds a spatial error component to a model.
    • & is not a reserved character in formulaic, so there are no conflicts.
  • The parser accepts combinations of <...> and &: <FIELD1 + ... + FIELDN> + & is the most general possible spatial model available. However, the dispatcher does not currently dispatch to the combo model classes (future TODO).

Importantly, all terms and operators MUST be space delimited in order for the parser to properly pick up on the tokens. The current design also requires the user to have constructed a weights matrix first, which I think makes sense as the weights functionality is well-documented and external to the actual running of the model.

While implementing this, I ran into stumbling blocks in other parts of the spreg API that have led me to work on a redesign of the basic modeling classes and their dependencies. These changes can be found in the api-dev branch of my fork of spreg, where I've been streamlining the user-oriented API to OLS (see prop_ols.py), spatial lag (prop_lag.py), and spatial error (prop_err.py) models. These new interfaces are works in progress and will be described further in a future Feature Enhancement Proposal. However, the new interfaces I've created are not backwards-compatible with current spreg code. Looking ahead, would it make sense to focus on designing a new package with updated spatial regression interfaces, or to create parallel interfaces in spreg to the same estimation code?

@tdhoffman
Copy link
Author

As an update to this, I've been cleaning up the formula parser and writing tests for it. I expect to have it fully prepared for review in the next few weeks. Among other things, I've written the code to dispatch to combo models and made it so the default behavior of <FIELD> is to include both FIELD and its lag in the model (so as to correctly specify SLX models). If a user only wants to include the spatial lag of a field, they can manually specify that by writing {w.sparse @ FIELD}, which is what the preprocessor generates for spatial lags.

Regarding the API changes, I wanted to elaborate more on some of the issues I've been encountering.

  1. Variable names, shapes, and consistency. In existing classes, variables tend to be named after mathematical symbols from the original papers. This sometimes makes the code hard to follow on its own. While internal variables (those required for estimation) are definitely simpler to have mathematical names, it's often nice to have more readable names for user-facing variables/class attributes.
  2. Location of estimation code and I/O checks. The current organization has base classes which contain estimation code and user-facing wrapper classes. Moreover, everything in these classes happens in the __init__ methods. In this way, the structure is more akin to Julia's code style (structs with data that get passed to functions) than to an object-oriented code style (classes with data and methods that manipulate their own data). This paradigm does not strictly rely on inheritance, instead preferring to link together classes by instantiating more abstract classes inside the more specific ones.
  3. Internals are hard to access outside of intended workflow. The base class / user class setup can be very confusing when studying the code. If a user is interested in seeing how the code works, the first place they'll go is the user-facing class. But when this class only has input validation and output processing, it quickly becomes difficult to trace the flow of data in the code's execution.

The designs I'm working on in the api-dev branch of my fork aim to remedy these in the following ways.

  1. Standardize the names of user-facing variables across the library. For example, rename betas to params and be explicit in each class' documentation about how the parameters should be parsed. (Perhaps a dict of params would be simpler for models with spatial and nonspatial parameters.) These new names should be readable in plain English, like replacing OLS.u with OLS.residuals.
  2. Redesign user-facing modeling classes to be nearly completely self-contained. The classes should contain all their own estimation code or use inheritance to avoid code duplication (e.g. a base TSLS class could have estimation code which lag models inherit from using super().fit()). Utility functions like input checks and output formatting can also be placed in superclasses so as to encapsulate these often repetitive tasks, which makes the interesting estimation code more readily accessible.
  3. Unify code by model type rather than estimation procedure. To most users, thinking about the estimation procedure comes second to thinking about the model specification. In my opinion, the code should reflect this by (for example) having just one Lag class with a fit() method that accepts an argument which selects the appropriate estimation procedure. This way, the external API is simplified to the names of the models while preserving the level of detail that advanced users want.

The api-dev branch contains the beginnings of these proposals, and I'll expand more on them after I submit the formula features. I'm interested in hearing thoughts about these changes, especially as I lack the historical information about spreg's development cycles that led to the current design.

@pedrovma
Copy link
Member

pedrovma commented Aug 1, 2022

Hi Tyler,

I've finally managed to go over your fork and see part of what you are working on. As mentioned on #95 , this topic has been under discussion within spreg for quite a while.

There are several complications to the application of formulas within our context. For example, the use of spatial lags for both y and X, and how to differentiate them, models that use spatial regimes and how to define what the regimes apply to or not, the specification of different equations for seemingly unrelated regressions, the combination of both regimes and SUR, and the list goes on getting more and more complicate. Add to all this the fact that spreg has a standalone GUI called GeoDaSpace.

I look forward to seeing what you will propose. However, a no-go for spreg is anything that is not backwards-compatible. Any additional feature must be able to conserve what is already there, so it doesn't break all that is built on top of spreg.

Additionally, I understand the motivation when you say, for example, that using mathematical symbols makes code harder to follow. However, in fact, they make it easier for anyone familiar with the literature. One example: if you were to state GM_Lag.residuals, I would have difficulties understanding precisely what you are referring to since spatial lag (and error) models have two different kinds of residuals. So please do keep these things in mind. IMO, the code should be easier to read for those familiar with spatial econometrics literature rather than computer scientists.

Please let me know if you need any support from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants