Email address profiling
Email profiling can work as great tool for assessing quality of a customer for many industries. In fintech, or insurance, it can be predictive for credit risk management or actuarial machine learning models.
To each part of email address, there is a different way to profile. Whereas username gives insights about choices, transparency, nicknames, randomness, the domain (part behind at) can provide a lot of useful information.
Profiling username come down to potential feature creation that can be used in scoring and detecting gibberish email addresses. Feature creation can focus on these:
- First name present
- Last name present
- Number present
- First name only
- Levenshtein distance from first name
- Levensthein distance from last name
All these are interesting indicators but cannot be used as good/bad – they can come with other attributes into predictive modeling.
Domain name profiling comes down to few things
- DNS information (MX records, domains)
- Disposable/Junk/temporary mail detection
- Email provider
Some of the rules for email address profiling based on domain name are straightforward. If the email is not deliverable, no bother sending it. Deliverability of an email can be assured by checking WHOIS records and MX DNS records. MX Records are like post code – saying to which location to send the emails physically. If there are none, then email cannot be delivered. No if or buts. WHOIS is telling you whether the domain exists and who is owner. If the domain does not exist, no bother sending an email.
Then it comes to junk or temporary email detection. If someone provides you with temporary mailbox like Guerrilla Mail, it is highly questionable whether they want to have any long-term relationship with you and your business. Services like that are usually used by fraudsters because they are easy to use and you can get as many different addresses as you want.
Subsequently, there is email provider detection. There are differences in free email like Gmail and business email hosted on G Suite. You can score identification of for example:
- Free email provider (gmail, Hotmail, …)
- Education institution
- Business Outlook
- Business Google Suite
- Generic webhosting email provider
- Self-hosted Outlook
- Other own email server
The type of service handling email can tell you a lot. Whereas there are no costs related to free email, there are more costs with cloud solution for companies, and managing own Outlook server requires expensive license and IT support. As such, it can be useful predictor for your models.
Lastly, there is an interesting thing called catch-all email address. Sometimes people buy a domain name and then redirect all incoming emails to their own mailbox. This can be negative, or positive. Negative if done by fraudsters who want to reuse as many domain names as possible in quick period. Positive if someone has bought own domain name, but do not know how to handle all the stuff related to own mail system management.