Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading External Reference Data into CDM - FpML Coding Schemes #879

Open
mgratacos opened this issue Dec 3, 2024 · 17 comments
Open

Loading External Reference Data into CDM - FpML Coding Schemes #879

mgratacos opened this issue Dec 3, 2024 · 17 comments
Labels
enhancement New feature or request subject: code generation This issue is about code generation subject: syntax This issue is about the syntax of Rosetta

Comments

@mgratacos
Copy link

Problem Statement

CDM currently loads FpML coding schemes and compiles them generating a static enumerated list using the Rune DSL docReference annotations.

  • This creates extra work for maintaining and releasing the CDM model.
  • This causes issues with current CDM users since these FpML coding schemes are updated more frequently than the CDM users upgrade their CDM implemented code.
  • FpML based code lists have changed an average of monthly for over a decade (140+ changes over 11 years).
  • This implies 140+ releases of CDM for reference data updates.
  • If multiple major versions of CDM are supported at once, this may imply several times more releases of CDM.
  • Rapidly evolving reference data exacerbates the challenge (e.g. large commodity code lists).
  • Inefficiencies limit overall effectiveness, inhibit interoperability and have more impact as CDM grows in adoption.

Lists in Scope

There are currently 15 CDM enumerations automatically regenerated by CDM on each release based on the FpML coding scheme. These all should be converted to the new mechanism.

10 CDM enumerated lists look like good candidates to be moved to the new mechanism because the schemes are relatively large (10+ values) and are expected to change relatively frequently, and the values are used infrequently in the code in most cases. However, they require further analysis and discussion.

Proposed Implementation

The proposed implementation changes the Rune syntax by extending the string data type (details subject to refinement based on detailed design and implementation work and feedback from the CDM TAWG and Rune working groups):

  • The existing Rune string type would be enhanced to include a new user-defined validation mechanism.
    • The existing string type supports validated properties (constraints) for minLength, maxLength, and pattern.
    • Two new constraint properties would be added:
      • validationRule would be a string that identifies the name of a user-defined validation rule that would be executed when the string is set, similar to the existing validation constraints. For the purposes of this project, we would always set this to the same value, something like "CodeListValidation”. Different values of this property could be used for other types of validation, such as CRCs/CheckSums, or database or API lookups.
      • domain would be a string that identifies the list of values to validate against, e.g. “currency-code” or “business-center” or “floating-rate-index”, and could be used by the validation rule to refine its validation.
    • The validation generation logic that currently validates the length and pattern constraints would be enhanced to invoke a validation stub function ValidateString with the parameters validationRule, domain, and the value of the string to be validated.
    • In the Rune runtime, this validation stub function would simply return a warning to implement the validation function.
  • We would create a CDM IsValidString user-defined function written in Rune to override the default validation logic from the Rune run-time and apply the appropriate user-defined validation logic, depending on the validationRule parameter. For the CodeListValidation rule it would validate the string value against a value retrieved based on the supplied and domain. For example, this could be a request to validate the code value “XYZ” or “USD” against the “currency-code” code list domain.
  • We would bind the Rune ValidateString function to the CDM IsValidString UDF implementation, so that when the string validation logic is triggered by the generated code and Rune runtime, it will call out to the CDM validation function.
  • For each enumeration being converted to a validated code, we would create a typeAlias including the appropriate constraint properties. For example:
    • typeAlias Currency : string (validationRule: “CodeListValidation”, domain: “currency-code”)
    • typeAlias BusinessCenter : string (validationRule: “CodeListValidation”, domain: “business-center”)
  • The IsValidString function would be a single function written in Rune as part of the CDM base functionality that could be used to check against any code list.
  • Existing occurrences of the enumerations would be replaced by the new typeAlias.
@SimonCockx SimonCockx added enhancement New feature or request subject: code generation This issue is about code generation subject: syntax This issue is about the syntax of Rosetta labels Dec 4, 2024
@plamen-neykov
Copy link

Some questions:

  • will the CDM project maintain the code lists or would the CDM validation machinery directly use/refer to the fpml ones?
  • If the CDM project is managing the code lists:
    • how will be they versioned? Will every code list be versioned independently or versions will be maintained for the full set?
    • how will the code lists be packaged and distributed?
    • would there be a set of code lists packaged with the CDM releases?
    • is it desirable to ensure that the user hasn't modified a code list and is always using the released version of it (e.g. preventing the creation of a non-standard model)?
  • would it be possible to provide the Rune (mock-up) code for the proposed IsValidString (does the Rune language have the facilities to work with exogenous files?)

@brianlynn2
Copy link

  • will the CDM project maintain the code lists or would the CDM validation machinery directly use/refer to the fpml ones?

    • The CDM project will generate the code list files based on FpML lists.
  • If the CDM project is managing the code lists:

    • how will be they versioned? Will every code list be versioned independently or versions will be maintained for the full set?
      • It’s not clear how this question is relevant for the required functionality. The beauty of the proposed solution is that the design could be evolved over time to accommodate different approaches without changing the Rune syntax. However, at this time the expectation is that rather than creating versioned code lists, there would be a single file for each code list with effective and deprecated dates for each code value within the list. However, during the design process it might be that we revert to the current FpML approach, which is to have each code list versioned and a versioned manifest. If we elect to go that route instead, it will not change the Rune syntax requirements.
    • how will the code lists be packaged and distributed?
      • A version of the code lists will be included in the resources folder of CDM. A mechanism will be provided for end users to refresh that based on a github repository.
    • would there be a set of code lists packaged with the CDM releases?
      • Yes, for convenience.
    • is it desirable to ensure that the user hasn't modified a code list and is always using the released version of it (e.g. preventing the creation of a non-standard model)?
      • No, not really. Occasionally an end user may wish to modify a code list for their own purposes, for instance for testing. Based on experience with FpML, end users have strong incentives to keep the lists in sync with the community and will wish to do so most of the time. (Users wishing to add new codes and to share them with other users will typically go to FpML to get the values vetted.) By providing a mechanism that allows users to easily check against the standard lists, and to update the standard lists quickly and efficiently, that provides sufficient incentive to keep to standard models. There is no need to apply a stick to enforce compliance.
    • would it be possible to provide the Rune (mock-up) code for the proposed IsValidString (does the Rune language have the facilities to work with exogenous files?)
      • The Rune mockup is not yet available, but in pseudo-code it would be something like this:

      • Check the validationRule; if it is “CodeListValidation” do the following:

        • Load the validation list based on the “domain”
        • Attempt to look up the string value being validated in the validation list
        • If absent, return an invalid code message
        • If present, optionally check the “effective” and “deprecated” dates and possibly return a warning if out of range. (Handing of date-depending logic is complex and this may not be implemented initially).
        • Return an “ok” indication, e.g. empty string.
      • The Rune language does not yet have a mechanism to work with exogenous files. It’s our intention to use the new data serialization/deserialization mechanism, currently under development, once it’s available. Until then we will write small functions in Java or Python to load the files using standard deserialization mechanisms (e.g. Jackson).

@plamen-neykov
Copy link

However, at this time the expectation is that rather than creating versioned code lists, there would be a single file for each code list with effective and deprecated dates for each code value within the list.

@brianlynn2 - an interesting approach - there are couple of implications in my view:

  1. the code lists will only be able to grow and never to shrink (e.g. obsolete values will be provided with a deprecation date, but not removed)
  2. the lists will still need a version (or a timestamp, which is anyway a form of versioning) as counterparties would need to know which set of entries are valid for a received CDM object as sender and receiver can be on different versions/timestamps of the codelist. An example would be when the sender has a more recent codelist containing additional entries which are unknown to the receiver.
  3. the valid from and deprecation date would require the CDM library to be time aware and also possibly it would require an asof date setting so that the comparison of the valid entries can be processed.
  4. based on 2. the serialisation should be aware of the versions of the code lists and should include their versions in the header of every serialised object.

@brianlynn2
Copy link

@plamen-neykov thanks for the questions.

Some feedback on your comments.

  • the code lists will only be able to grow and never to shrink (e.g. obsolete values will be provided with a deprecation date, but not removed)

    • In fact, this is already the practice for FpML code lists, because trades are long lived (some as long as 40-50 years) and even once terminated the trades may be kept in archives. So ISDA policy has been to keep all codes for all time. In the floating rate index scheme, for instance, there are codes that haven't been used for new trading since 2006 or before.
  • the lists will still need a version (or a timestamp, which is anyway a form of versioning) as counterparties would need to know which set of entries are valid for a received CDM object as sender and receiver can be on different versions/timestamps of the codelist. An example would be when the sender has a more recent codelist containing additional entries which are unknown to the receiver.

    • FpML built a sophisticated mechanism to support this. And the industry voted resoundingly to ignore that. Receivers of FpML messages typically don't care what list the sender thought they were using, just what version of the list that they (the receivers) have implemented. And they generally validate all messages against the latest version of the code list. Consequently, the standard industry practice for code lists is for messages to specify (or default to) the standard, versionless code list identifier, which means in effect the most recent version of the code list. (Another reason that codes are never retired). So generally most firms are only testing against a single set of codelists, and never look up old versions of code lists to see if the trades were valid against that version.
    • So the consequence of this is that we don't know what version of the codelists the sender thought they used. Getting the industry to shift to a model where they include the version when they create the messages would be a very difficult slog, particularly because in practice receivers prefer to use their own versions of the lists.
    • So this solution we propose is intended to fit within the existing message flow, and not to attempt to force a major industry change.
  • the valid from and deprecation date would require the CDM library to be time aware and also possibly it would require an asof date setting so that the comparison of the valid entries can be processed.

    • The base date for any date evaluation typically isn't the system date, but rather a key date in the object being validated (e.g. event date) so the CDM library doesn't necessarily need to be date aware. But you are correct that some kind of as-of date setting may be necessary and useful for this. That's something we'll look at in CDM. However, we don't see changing the Rune language for that, at least not at this time, but rather some kind of extension function. Complexities around this point are one of the reasons we might not choose to implement the date checking functionality.
  • based on 2. the serialisation should be aware of the versions of the code lists and should include their versions in the header of every serialised object.

    • See above. Sometimes the clever technical solutions we boffins come up with are ignored in practice. We are going with the assumption that the receiver will decide for themselves which version of a list to apply.

@plamen-neykov
Copy link

plamen-neykov commented Dec 5, 2024

@brianlynn2

See above. Sometimes the clever technical solutions we boffins come up with are ignored in practice. We are going with the assumption that the receiver will decide for themselves which version of a list to apply.

The version of the code lists used for building the model can be included automatically in the output json - this is nothing the user has to manage himself.

Even if this information is ignored at ingestion by the software, it still can be used by the user to identify quickly the issue and minimize the time required to diagnose why particular messages do not deserialize.

I'm pretty sure, that investing the couple of hours needed to include the version(s) of the codelists in the serialised object will save quite sizeable amount of investigation time across the industry.

Note - the new serialisation already carries the version of the CDM model and it will not be a huge effort to include this additional information.

@mgratacos
Copy link
Author

Just to clarify, the scope of the project is not to change functionally the reference data model that is consumed currently in FpML. CDM currently consumes the latest version of the FpML coding schemes available and generates the enumerated lists. For example:
enum BusinessCenterEnum: <"The enumerated values to specify the business centers."> [docReference ISDA FpML_Coding_Scheme schemeLocation "http://www.fpml.org/coding-scheme/business-center"]

We are not planning to change that "latest version" consumption behaviour. Implementers can override the default implementation if they want and use specific versions but that's not the scope of the project.

@brianlynn2
Copy link

If the implementers of the serialization logic want to store the version of each codelist that was in effect when the trade was created, that's fine by me. Perhaps someday that could be used for a smarter validation. But as Marc says, that's not our project.

@plamen-neykov
Copy link

plamen-neykov commented Dec 5, 2024

If the implementers of the serialization logic want to store the version of each codelist that was in effect when the trade was created, that's fine by me. Perhaps someday that could be used for a smarter validation. But as Marc says, that's not our project.

@brianlynn2 I agree, that including the code list(s) versions in the serialisation output should be implemented by the serialisation developers (and I wasn't eluding it should be part of your project).

@SimonCockx
Copy link
Contributor

SimonCockx commented Dec 17, 2024

Overall I think the idea of supporting custom validation logic on basic types is great. I do have some remarks/ideas about how we could declare it, so below are a few iterations of my thought process.

As an example, I will use currency codes.

Current proposal (completed with the actual function definition)

typeAlias CurrencyCode:
    string(validationRule: "ValidateCodeList", domain: "currency-code")

func ValidateCodeList:  // this function will be implemented in Java
    inputs:
      inputToValidate string (1..1)
      domain string (1..1)
    output:
      isValid boolean (1..1)

Remarks:

  • The validationRule argument refers to the function ValidateCodeList - as such I think it should be represented as an actual reference to a function for proper highlighting (highlighted as a function rather than a string), validation (the function must output a boolean) and scoping (imports work, etc), instead of as a string.
  • The domain argument represents an additional argument to the ValidateCodeList function. This doesn't seem very extensible. What if the function accepts more than one additional argument? What if it accepts none? What if the parameter names coincides with a built-in type parameter such as maxLength?

Iteration 1

typeAlias CurrencyCode:
    string(validationRule: ValidateCodeList(item, "currency-code"))

Note that we are just calling the function directly with the "currency-code" argument. It doesn't have to be a function call though - it can be an arbitrary expression that evaluates to a boolean. In this instance item refers to the string being validated.
Remarks:

  • Until now, type arguments always used to represent expressions that could be statically evaluated (just literals in most cases). With this proposed change, that would change, with as a consequence a significant impact on the DSL parser and type system. However, we already have existing syntax that does something similar: conditions!

Iteration 2

typeAlias CurrencyCode:
    string

    condition IsCurrencyCode:
      ValidateCodeList(item, "currency-code")

The good thing is we're reusing an existing well-thought through concept, with all of its benefits:

  • We have a name to display in the validation report, which we did not have before. (IsCurrencyCode)
  • A type alias can have multiple conditions, so its more extensible.
  • There is less syntactical changes required, and modellers already know the concept.

Additionally, instead of defining a type alias for every FpML coding scheme, which is over 200, we could use the existing (but not well known) feature of parameterized type aliases:

typeAlias FpMLCodingScheme(domain string):
    string

    condition IsFpMLCodingScheme:
      ValidateCodeList(item, domain)

Note that the domain is now represented as a type parameter, so it can be used directly in a type:

type Foo:
  currencyCode FpMLCodingScheme("currency-code") (1..1)

Or, if modellers prefer to name it more explicitly, they can still reuse the FpMLCodingScheme to define a CurrencyCode type:

typeAlias CurrencyCode:
    FpMLCodingScheme("currency-code")

type Foo:
  currencyCode CurrencyCode (1..1)

Open for feedback.

@brianlynn2
Copy link

Hi @SimonCockx ...

Thanks for these ideas. In general I like the idea of using a "condition" on a string typeAlias if this is possible. We'd considered the possibility but didn't know how to implement it, so proposed what we thought was a simpler solution. Is it possible to do this without a Rune syntax change or enhancement? If so, that could work very well for us.

I agree that directly using the Rune validation function name rather than a string code would be preferable, but wasn't sure how to implement that easily; so we proposed something we thought might be simpler to implement. But from your comment it looks like it can be done using existing Rune syntax. If so, that would be perfect.

Both of your ideas look good. The parameterized typeAlias is particularly clever, while the more explicit representation is also very straightforward. We'll try testing those approaches to see if they work for us.

Of the 200 FpML coding schemes, it looks like only a subset (about 40) are in use by CDM in a way that we would want to map to the new approach, in the short term. But having a solution that could work for the 200 would be super, as CDM grows.

@brianlynn2
Copy link

@SimonCockx I looked at this in more detail and now I understand that you meant we should add the "condition" capability to a typeAlias, that it's not already possible to do so. Can you please advise what would be entailed in imlementing that? Our current solution is quite simple to implement, and we're not keen to do something more complex. But if adding "condition" to typeAlias would be simple, that might be a good direction to pursue.

@mgratacos
Copy link
Author

mgratacos commented Dec 19, 2024

@SimonCockx @brianlynn2 We looked at this in a bit more detail. Simon's option 1 is what we (@manel-martos and @arnauoller) are exploring. We are interested in completing the research in changing rune-dsl. However, we see the benefits of Simon's option 2 since we wouldn't need to change the rune-dsl syntax. We don't have a strong preference for either of the two options since functionally they are equivalent.

@SimonCockx
Copy link
Contributor

SimonCockx commented Dec 19, 2024

Hi @brianlynn2, happy to give a quick overview of what would change if we were to add conditions to type aliases.

  1. A small change to the abstract syntax tree (AST) of Rune. Inside the Rosetta.xcore file, which defines the AST, a conditions property should be added to RosettaTypeAlias.
  2. A small change to the parser of Rune. Inside the Rosetta.xtext file, which defines the parser, the conditions property should be populated with the existing Condition parser rule, e.g., conditions=Condition*.
  3. A generalization of the implicit variable item. Right now, item inside a condition of a type is defined to represent an instance of that type. This should be generalized to behave in the same way for typeAliases. (see ImplicitVariableUtil - and its usage).
  4. (optionally) Make type parameters available inside a condition. (see RosettaScopeProvider)

These are specific to this solution. The rest is common: changes to the type system and changes to the Java generator. Note that there are many things you get for free - function reference resolution (including functions from other namespaces which are imported), support for arbitrary expressions, type checking inside your condition, a condition name to display in the validation report, syntax highlighting, ... - all things which you'll have to think about if you follow a different path. In that sense, I think the string type parameter solution might look deviously easy at first, although you'll encounter all of these issues fast once moving forward.

Other than that, because of the central role this project takes in many other projects (CDM/DRR/...), we only accept high quality contributions, so the argument of having an easier implementation does not weigh up to that. Happy to help along the way though, whichever direction we end up going!

@brianlynn2
Copy link

@SimonCockx ... thanks for the description. It does not sound particularly easy, but we'll look at it.

When you say "we only accept high quality contributions, so the argument of having an easier implementation does not weigh up to that." you sound a bit like a Mafia don rather than someone trying to help. The existing solution for the reference data (using a hard coded enumeration for lists that change often monthly to quarterly) is an utterly unacceptable solution, not "high quality" in any way in our opinion. Our efforts to get that changed that resulted in a complex REGnosys-designed solution to update some of the enumerations automatically at build time, but didn't actually solve the underlying problem.

If we can't make a solution with a syntax change with a reasonable amount of effort, we replace it with a string and find another way to trigger validation.

@SimonCockx
Copy link
Contributor

I did not intend to come across like that, nor do I wish to enter a conflict. To be able to collaborate effectively and comfortably on this issue, I would like it if we would be able to find an agreement on the path forward. Perhaps we can plan a chat when everyone is back from their holidays?

@brianlynn2
Copy link

@SimonCockx thank you for your words. I agree that a face to face discussion early in the New Year would be helpful. Meanwhile Manuel and company at TradeHeader can look into your ideas and see how easily they can implement them.

To give you some context for my frustration:

  • ISDA FpML staff identified the issue that CDM reference data is hard-coded as enumerations c. 2018 or 2019 and warned CDM management of its concerns. At the time CDM had pressing priorities to get initiated and could not address the issue while meeting its other commitments.
  • In 2021, when FpML and CDM were moved under the same ISDA management team, it was identified that this problem remained, and that synchronization between the FpML code lists and the CDM enumerations was not occurring. (For instance, the 2021 floating rate index definitions, a huge ISDA project, the biggest I remember in at well over a decade, were not available in CDM for many months after they went into production in the industry.) A solution such as we are developing now was proposed to address this problem. Instead, a technically clever solution was developed to mostly-automatically update the enumerations based on FpML code lists prior to release, partly because CDM designers believed that an enumeration-based solution would give tighter validation and better consistency across the industry. This solved part of the problem - now published versions of CDM included the latest, greatest code lists (at least for certain lists) - but not how to maintain the lists after CDM-based systems entered production. Which was the core of the problem.
  • In 2023, after some more ISDA and CDM management changes, this was raised again as an issue and eventually a consensus was reached that something should be done about it. Eventually a CDM task force was convened to assess how best to solve this issue. This team spent several months arguing how to do this and ultimately recommended a solution that included a small Rune syntax enhancement to better support the reference data checking.
  • In early 2024 a call for solutions was made. It was clear that funding for the solution was very limited, and so GEM and TH proposed implementing a solution without a Rune syntax change, on the basis that it would be cheaper to do it that way, and easier to get approvals, and still meet the business requirements. (The idea for how to do that, using conditions on user defined types, came from Minesh at REGnosys). There was some pushback in CDM management to this approach, and despite its very low cost it took most of a year to secure funding for the solution. Meanwhile to address the pushback to our solution, GEM and TH proposed adjusting the solution to include the simplest, easiest to implement Rune enhancement that would address our needs, so that the cost of the proposal could be kept as low as possible. (This solution, slightly enhancing the string type, was also based on an idea from Minesh; we did some more work to make the Rune part of the implementation as small as possible.)
  • Meanwhile during 2024 production users of CDM began discovering that the CDM reference data mechanism would require a CDM version upgrade each time new FpML codes were added, and they were not at all happy about that, to put it mildly. The problem that we'd been identifying for many years finally began happening once CDM was in production, as we expected. Finally once the evidence of the problem became inescapable, there was enough pressure to solve it.

The point of this story is that technical perfection and elegance of a solution is only part of the decision making process. We also need to meet business requirements. And because of the challenges of getting funding for this, one of the requirements we needed to meet is to make the solution relatively affordable.

If we were to create a basic solution using typeAlias to a slightly enhanced string type, as in our initial proposal , it would be trivial to adjust that solution to base the typeAlias on some better syntax, like either of your proposals, and then everything else should continue to work without change. Much more of our work in this project is about other things unrelated to the Rune syntax, like getting valid JSON from FpML lists and loading it and validating against it. So there's nothing about the basic solution that prevents making it better in the future, once there is funding to make a technically better (i.e. more general and all encompassing) Rune enhancement.

Or perhaps with your assistance Manuel and his colleague can make one of your (I agree, technically superior) solutions work in a reasonable time frame.

I am very supportive of creating the best technical solutions we can to address business problems, as long as those solutions can be within the constraints given to us. Sometimes the "best" technical solutions may not be able to be implemented as soon as we like, due to real world constraints. Like money.

Best regards,
Brian

@manel-martos
Copy link

We at TradeHeader are focused on completing our learning curve with the Rune upgrade, as we recognize that the complexity inherent in all projects requires time and careful resolution. We are adhering to our planned calendar and aim to deliver technical conclusions to @brianlynn2 before the end of the year. Following the holidays, I will continue to review Simon’s proposals. Our objective is to fully meet requirements while also mitigating potential contingencies, with DSL being one of them.

@SimonCockx – thank you for your valuable input during Wednesday’s call. It significantly clarified our initial changes and allowed us to enhance our testing scenarios. I am currently investigating the build, having intentionally modified the type creator to conduct a deeper inspection of the built-in type-related features we discussed. At this stage, we do not plan to submit any contributions.

Regarding the notion of high-quality contributions – we fully align with the understanding that contributions should be comprehensive, meet all functional requirements, include thorough documentation via release notes, and expand testing scenario coverage for the proposed changes or additions. This is the approach TradeHeader intends to maintain for all future work.

I kindly ask for patience as we continue this process. Should there be any need for clarification, we will let you all know with enough time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request subject: code generation This issue is about code generation subject: syntax This issue is about the syntax of Rosetta
Projects
None yet
Development

No branches or pull requests

5 participants