Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API changes #112

Closed
wants to merge 7 commits into from
Closed

API changes #112

wants to merge 7 commits into from

Conversation

rorycl
Copy link

@rorycl rorycl commented Jan 4, 2025

Hi @mnako

I've had a chance to look at how to potentially implement new package level funcs, as set out in this PR. Let me know what you think.

Cheers,
Rory


api: split main package ParseEmail func into several variants

Split ParseEmail into three functions with the function signature

   io.Reader (Email, error)

The three functions are

  • ParseEmail (the existing main entry point)
  • ParseEmailHeaders
  • ParseEmailWithoutAttachments

ParseEmailHeadings replaces ParseHeadings that takes an Email parameter
instead of an io.Reader.

The existing ParseEmail function is renamed to parseEmail and the
processing options determined by the new processType type.

rorycl added 5 commits January 4, 2025 16:49
Split ParseEmail into three functions with the function signature

   io.Reader (Email, error)

The three functions are

   ParseEmail (the existing main entry point)
   ParseEmailHeaders
   ParseEmailWithoutAttachments

ParseEmailHeadings replaces ParseHeadings that takes an Email parameter
instead of an io.Reader.

The existing ParseEmail function is renamed to parseEmail and the
processing determined by the new processType type.
The test comparison function has also been moved from reflect.DeepEqual
to go-cmp.Equal.
@rorycl rorycl requested a review from mnako as a code owner January 4, 2025 19:27
@rorycl
Copy link
Author

rorycl commented Jan 6, 2025

I've run a quick benchmark to compare ParseEmail, ParseEmailHeaders and ParseEmailWithoutAttachments. The setup is 5 emails with ~20MB attachments each. The test is attached as performance_test.go.txt.

The results are as follows on my ancient thinkpad t470s:

rory:~/src/letters$ go test -run=Benchmark -bench=.  -benchtime=15s
goos: linux
goarch: amd64
pkg: github.com/mnako/letters
cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
BenchmarkParseEmail-4                     	      14	1091424774 ns/op
BenchmarkParseEmailHeadersOnly-4          	    5770	   5307012 ns/op
BenchmarkParseEmailWithoutAttachments-4   	      20	 839565841 ns/op
PASS
ok  	github.com/mnako/letters	65.073s

I'm surprised that ParseEmailWithoutAttachments is only a bit over 20% faster than ParseEmail but I guess the programme is still having to read through the MIME parts. I realise this isn't a huge performance gain but it is useful for programmes such as my mailfinder programme.

@rorycl
Copy link
Author

rorycl commented Jan 6, 2025

Apologies, after looking at my work from the weekend I realise I hadn't put types into the structs.go file -- I'm not used to doing that.

I can provide a new PR after receiving comments, if any.

@rorycl
Copy link
Author

rorycl commented Jan 8, 2025

I've been playing with some better WithoutAttachments code. It isn't necessary to drain the io.Reader sent to decodeAttachmentFileFromBody. Simply continuing the loop in parsePart for any attached or inline file gives an order of magnitude speedup for ParseEmailHeadersOnly over ParseEmail. See this patch. The revised benchmark, (now on a plugged in laptop running a bit faster than the last reported benchmarks, but otherwise the same setup) are:

$ go test -run=Benchmark -bench=.  -benchtime=15s 
goos: linux
goarch: amd64
pkg: github.com/mnako/letters
cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
BenchmarkParseEmail-4                     	      21	 825746939 ns/op
BenchmarkParseEmailHeadersOnly-4          	    8528	   2303640 ns/op
BenchmarkParseEmailWithoutAttachments-4   	     208	  85559903 ns/op
PASS
ok  	github.com/mnako/letters	64.464

Again, I'm happy to rework this patch set based on your thoughts and advice.

rorycl and others added 2 commits January 8, 2025 20:40
This simplifies the approach to skipping both inline and attached files
by continuing to the next call to multipartReader.NextPart.
@mnako
Copy link
Owner

mnako commented Jan 9, 2025

@rorycl,

Thank you for taking the time to open this PR. I had a chance to read through the code and description carefully.

I now see that this PR closely follows a discussion that we have already had in that it exports ParseEmailHeaders() and ParseEmailWithoutAttachments(). You have also moved from a string parameter "HeadersOnly" to an processType int.

My previous concerns regarding an explicit headersOnly option (whether string or int) still hold:

I still have some objections against a headersOnly option, regardles of whether we implement that as a string arg, a config struct, or—I really like that idea as referenced by @rorycl—a self-referential function options.

Generally, I think that options of the form xOnly, skipY, overrideZ point out design flaws and highlight places that are not flexible enough. I think that we should offer the option to parse headers only, but in a more coherent way, rather than framing it as an exception to the default flow.

You can find more details on this in my previous comments: #85 (comment) and #85 (comment).

I appreciate the desire to process emails without attachments. However, I am concerned that introducing functions for every possible use case (e.g., ParseEmailPlainTextBodiesOnly()) might become difficult to maintain in the long run. It could lead to an explosion of specific methods, each addressing a narrow use case.

Instead of multiple functions, I am exploring alternative solutions to provide the desired functionality and performance. These could include:

  • Functional filtering predicates: Developers could pass a function that determines which email elements need parsing and which can be ignored.
  • Lazy loading email structure fields: Headers, bodies, attachments, etc., would only be parsed when first accessed.

Both approaches have trade-offs, such as memory usage for the original reader versus performance benefits of lazy loading. I will carefully consider these options and try and work on that direction soon.

@rorycl
Copy link
Author

rorycl commented Jan 10, 2025

Hi @mnako

Thanks for considering this PR and your thoughtful comments.

I guess there are two major considerations: the API of the package, and the implementation. Excuse the long essay which follows, which is meant to be helpful!

API

Regarding the API I think it would be helpful for users of the package to skip elements of processing such as bodies or attachments. Users may also wish to specify non-strict options such as looser or custom date and address parsing.

We have discussed a single entry point at ParseEmail with func options, as suggested in #64.

A set of options that spring to mind include:

WithLooseAddressParsing
WithCustomAddressFunc
WithExtendedDateParsing
WithCustomDateFunc
WithCustomDecoderFunc 
WithOmitBody
WithOmitAttachments

Having a single entry point suggests that a "sparse" *Email might sometimes be returned with, for example, no body or attachments. I think this is reasonable, but suggests that the newly introduced ParseHeaders function should be removed from the API.

A module user using full email parsing could switch from

email, err := ParseEmail(r)

to the following if they only wanted headers:

email, err := ParseEmail(r, WithOmitBody(true))

or to the following if they wanted to omit attachments and have custom date parsing:

email, err := ParseEmail(r, WithOmitAttachments(true), WithCustomDateFunc(datefunc))

etc.

(I'd still argue that ParseEmail, ParseEmailHeaders and ParseEmailWithoutAttachments are mutually exclusive options each of which could benefit from the With type func options noted above, apart from the last two, and work "out of the box" for most library users. But I won't stretch the point further.)

I wonder also if there is scope to reduce the number of public package identifiers shown by go doc ..

Implementation

I suggest that the implementation of the mechanics of letters can differ from its API. For example a central config or option struct or enum can allow switching between modes of operation. Keeping this configuration separate from the Email struct seems a good idea both for testability and composability.

It may be worth considering forming submodules to help make letters more maintainable in future. Along the themes of the options I've noted above, I can imagine date, address and decoding submodules that would allow independent testing from the main corpus of test emails. This may also allow custom funcs to be incorporated more naturally.

Regarding processing email without including reading all the attachments into memory, I suggest that (with reference to 14), that io.Readers are returned instead.

For example, AttachedFile which presently is defined as follows:

type AttachedFile struct {
    ContentType        ContentTypeHeader
    ContentDisposition ContentDispositionHeader
    Data               []byte
}

Could be altered to:

type AttachedFile struct {
    ContentType        ContentTypeHeader
    ContentDisposition ContentDispositionHeader
    Reader             io.Reader // decoder wrapped io.Reader
    Size               int       // number of bytes
}

InlineFile would be similarly altered.

Package users could choose whether to process the file, and how.

If it is possible to wrap the part io.Reader from decodeAttachedFileFromPart(part *multipart.Part...) with the necessary decoders without reading bytes, this might be possible.


Thanks again for your excellent work on this module, and for the chance for a discussion.

@mnako
Copy link
Owner

mnako commented Jan 10, 2025

@rorycl,

thank you for the comments. Apologies, but I only have just enough time to send two comments now:

I think it would be helpful for users of the package to skip elements of processing such as bodies or attachments. Users may also wish to specify non-strict options such as looser or custom date and address parsing.

I completely agree with the need. I just think that we need to figure out how best to do it, before calling letters 1.0-mature.

A set of options that spring to mind include:

WithLooseAddressParsing
WithCustomAddressFunc
WithExtendedDateParsing
WithCustomDateFunc
WithCustomDecoderFunc
WithOmitBody
WithOmitAttachments

If I am understanding this correctly, this suggests a number of helper functions to construct functional options that sets a feature on the email parser's config.

On the first read, I do find this idea elegant and composable.

Please allow me some time to read all of your comments and think about composability (are the with() helpers chainable?) and semantics (should we prefer WithOmitAttachments() or WithAttachments()?) and I will get back with another comment.

@rorycl
Copy link
Author

rorycl commented Jan 11, 2025

Hi @mnako

Thanks for your further response.

I've made a quick sketch of how functional options might be implemented here.

If you pull the parser-options branch of my fork of your repo at github.com/rorycl/letters that should be runnable with

go test -v -run TestOptionParse .                                                          

Further note, 14 Jan 2025

A further note about the possible benefits of separating initialising the parser from running it is that the initialisation could only be needed once but used to process many emails.

An example use case is the code here.

@rorycl rorycl changed the title API changes: use the same function signature for package funcs API changes Jan 11, 2025
@mnako
Copy link
Owner

mnako commented Jan 19, 2025

@rorycl ,

I've made a quick sketch of how functional options might be implemented here.

Thank you for sharing that code. I mostly agree with the suggested approach for options.

One place that I still find not flexible enough are the wholeEmail, headersOnly, and skipAttachments options. Instead, in #119 I am suggesting a functional filter approach, where the developer using letters can decide whether to parse all:

emailParser := letters.NewEmailParser(
    WithFileFilter(AllFiles)	
)
email, err := emailParser.ParseEmail(rawEmail)

none:

noFilesEmailParser := letters.NewEmailParser(
    WithFileFilter(NoFiles)	
)
email, err := noFilesEmailParser.ParseEmail(rawEmail)

or (most interesingly) some files:

customJPGOnlyEmailParser := letters.NewEmailParser(
    WithFileFilter(
        func(cth ContentTypeHeader) bool {
            return strings.HasSuffix(strings.ToLower(cth.Params["name"]), ".jpg")
        },
    ),	
)
email, err := customJPGOnlyEmailParser.ParseEmail(rawEmail)

This avoids the code smell of xOnly and skipY that I mentioned during our previous discussions and opens letters for even more use cases: e.g. parsing only the attachments that a given program is interested in, while ignoring everything else.

I included more information in https://github.com/mnako/letters/pull/119/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R178.

Please let me know what you think and if this covers your use case.

This PR is still very far from being production ready, but I can commit to seeing it through if we agree on this direction.

@mnako
Copy link
Owner

mnako commented Jan 19, 2025

cc @ValleyCrisps for your opinion on #112 (comment) and #119

@rorycl
Copy link
Author

rorycl commented Jan 19, 2025

Hi @mnako

Your suggestions for functional options/filters sound great. I particularly like the idea of the options allowing the library user to extract (for example) jpegs.

Allowing WithFileFilter to be a user supplied func allows lots of opportunities for customisation by the user, without changing the library. That would be a very valuable improvement.

Additionally users could supply a func to decide what to do with files at the io.Reader level, as set out in #118 and my nobuf branch centering on changes to decoders.go and some sketch changes to structs.go -- which pass the test suite fine.

Therefore an opt such as WithFileFilter might be a library supplied version of a broader library capability for users deal with attachments and also never have to materialise them into []byte in the Email struct.

I hope to have time later in the week to do a refactoring along the lines I've mentioned in #118 if you have time to make some comments beforehand, that would be great. I'd like to make that work as useful as possible.

--

I suggest we close this PR and move the discussion to #118 if that works for you.

@ValleyCrisps
Copy link
Collaborator

It is very nice to see the active discussion, and the thought going into the parse exensibility! Let me weigh in with a few comments as well.

In general, I think we should prioritise extensible and maintainable solutions over hyper-customisation.
I can see the use-case for skipping the parsing of attachments, as it has the potential to save both memory and execution time. On the other hand, named parameters/options like withXParsingOption or processType are hard to maintain so I would recommend having as few as possible and trying to abstract the parsing logic instead.

I think the functional filters @mnako proposed cover the most common use-cases, and offer additional flexibility on top of it (I really like the option to parse only some file types, I would not have thougth of it myself, but I can see its utility).

For the more specialised use-cases, I wonder if it would not be better keep the email parser simple (single-responsibility), and instead address them with post-parsing logic.

@rorycl
Copy link
Author

rorycl commented Jan 26, 2025

Hi @ValleyCrisps

I haven't seen any suggestions for hyper-customisation. Certainly my own suggestions have centred (even if poorly phrased) on allowing part-processing of an email or the supply of user-defined functions.

So far these suggestions have centred on:

  • skipping processing parts of an email if the part is not of interest
  • allowing the user to provide their own date parsing func
  • allowing the user to provide their own address parsing func
  • considering how to pass a file processor func to allow attached and inline files to remain io.Readers

These suggestions all hinge on whether the parser is defined by a struct and if its main funcs hang off that struct. Having the parser defined by a struct opens up customisation possibilities.

As it happens in the 10s of thousands of emails I've used to parse with letters the majority work great! (Thanks @mnako). However there are a substantial minority of emails that don't meet the formats that letters accepts. Even net/mail covers substantially more date formats than letters, for example.

So what is a user to do when they have to parse an email but letters won't accept the date or address (for example). Get another library? Might it not be better to just let them supply their own func if that is going to let them continue work? The main power of a With option is to allow the user to pass a closure to solve this sort of problem. This would make letters both simpler in the long run and much more useful.

Allowing users to override default funcs is, in my view, the way to go. The other tweaks will likely make a big improvement to memory usage and processing speed too.

@mnako
Copy link
Owner

mnako commented Feb 2, 2025

In general, I think we should prioritise extensible and maintainable solutions over hyper-customisation. I can see the use-case for skipping the parsing of attachments, as it has the potential to save both memory and execution time. On the other hand, named parameters/options like withXParsingOption or processType are hard to maintain so I would recommend having as few as possible and trying to abstract the parsing logic instead.

I think the functional filters @mnako proposed cover the most common use-cases, and offer additional flexibility on top of it (I really like the option to parse only some file types, I would not have thougth of it myself, but I can see its utility).

Thank you for your comments, @ValleyCrisps.

Since we agree on all points, there is no need for me to come back to the processType idea. Instead, we can focus on #119.

@rorycl rorycl closed this Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants