Phishing URL Detection with ML (2022)

Phishing URL Detection with ML (1)

Phishing is a form of fraud in which the attacker tries to learn sensitive information such as login credentials or account information by sending as a reputable entity or person in email or other communication channels.

Typically a victim receives a message that appears to have been sent by a known contact or organization. The message contains malicious software targeting the user’s computer or has links to direct victims to malicious websites in order to trick them into divulging personal and financial information, such as passwords, account IDs or credit card details.

Phishing URL Detection with ML (2)

Phishing is popular among attackers, since it is easier to trick someone into clicking a malicious link which seems legitimate than trying to break through a computer’s defense systems. The malicious links within the body of the message are designed to make it appear that they go to the spoofed organization using that organization’s logos and other legitimate contents.

In this article I explain: phishing domain (or Fraudulent Domain) characteristics, the features that distinguish them from legitimate domains, why it is important to detect these domains, and how they can be detected using machine learning and natural language processing techniques.

Phishing URL Detection with ML (3)
Phishing URL Detection with ML (4)
(Video) Phishing detection using machine learning technique

Many users unwittingly click phishing domains every day and every hour. The attackers are targeting both the users and the companies. According to the 3rd Microsoft Computing Safer Index Report, released in February 2014, the annual worldwide impact of phishing could be very high as $5 billion.

What is the reason of this cost?

The main reason is the lack of awareness of users. But security defenders must take precautions to prevent users from confronting these harmful sites. Preventing these huge costs can start with making people conscious in addition to building strong security mechanisms which are able to detect and prevent phishing domains from reaching the user.

Lets check the URL structure for the clear understanding of how attackers think when they create a phishing domain.

Uniform Resource Locator (URL) is created to address web pages. The figure below shows relevant parts in the structure of a typical URL.

Phishing URL Detection with ML (5)

It begins with a protocol used to access the page. The fully qualified domain name identifies the server who hosts the web page. It consists of a registered domain name (second-level domain) and suffix which we refer to as top-level domain (TLD). The domain name portion is constrained since it has to be registered with a domain name Registrar. A Host name consists of a subdomain name and a domain name. An phisher has full control over the subdomain portions and can set any value to it. The URL may also have a path and file components which, too, can be changed by the phisher at will. The subdomain name and path are fully controllable by the phisher. We use the term FreeURL to refer to those parts of the URL in the rest of the article.

The attacker can register any domain name that has not been registered before. This part of URL can be set only once. The phisher can change FreeURL at any time to create a new URL. The reason security defenders struggle to detect phishing domains is because of the unique part of the website domain (the FreeURL). When a domain detected as a fraudulent, it is easy to prevent this domain before an user access to it.

Some threat intelligence companies detect and publish fraudulent web pages or IPs as blacklists, thus preventing these harmful assets by others is getting easier. (cymon, firehol)

The attacker must intelligently choose the domain names because the aim should be convincing the users,and then setting the FreeURL to make detection difficult. Lets analyse an example given below.

Phishing URL Detection with ML (6)

Although the real domain name is active-userid.com, the attacker tried to make the domain look like paypal.com by adding FreeURL. When users see paypal.com at the beginning of the URL, they can trust the site and connect it, then can share their sensitive information to the this fraudulent site. This is a frequently used method by attackers.

(Video) Phishing URL Detection Presentation

Other methods that are often used by attackers are Cybersquatting and Typosquatting.

Cybersquatting (also known as domain squatting), is registering, trafficking in, or using a domain name with bad faith intent to profit from the goodwill of a trademark belonging to someone else. The cybersquatter may offer selling the domain to a person or company who owns a trademark contained within the name at an inflated price or may use it for fraudulent purposes such as phishing. For example, the name of your company is “abcompany” and you register as abcompany.com. Then phishers can register abcompany.net, abcompany.org, abcompany.biz and they can use it for fraudulent purpose.

Typosquatting, also called URL hijacking, is a form of cybersquatting which relies on mistakes such as typographical errors made by Internet users when inputting a website address into a web browser or based on typographical errors that are hard to notice while quick reading. URLs which are created with Typosquatting looks like a trusted domain. A user may accidentally enter an incorrect website address or click a link which looks like a trusted domain, and in this way, they may visit an alternative website owned by a phisher.

A famous example of Typosquatting is goggle.com, an extremely dangerous website. Another similar thing is yutube.com, which is similar to goggle.com except it targets Youtube users. Similarly, www.airfrance.com has been typosquatted as www.arifrance.com, diverting users to a website peddling discount travel. Some other examples; paywpal.com, microroft.com, applle.com, appie.com.

There are a lot of algorithms and a wide variety of data types for phishing detection in the academic literature and commercial products. A phishing URL and the corresponding page have several features which can be differentiated from a malicious URL. For example; an attacker can register long and confusing domain to hide the actual domain name (Cybersquatting, Typosquatting). In some cases attackers can use direct IP addresses instead of using the domain name. This type of event is out of our scope, but it can be used for the same purpose. Attackers can also use short domain names which are irrelevant to legitimate brand names and don’t have any FreeUrl addition. But these type of web sites are also out of our scope, because they are more relevant to fraudulent domains instead of phishing domains.

Beside URL-Based Features, different kinds of features which are used in machine learning algorithms in the detection process of academic studies are used. Features collected from academic studies for the phishing domain detection with machine learning techniques are grouped as given below.

  1. URL-Based Features
  2. Domain-Based Features
  3. Page-Based Features
  4. Content-Based Features

URL-Based Features

URL is the first thing to analyse a website to decide whether it is a phishing or not. As we mentioned before, URLs of phishing domains have some distinctive points. Features which are related to these points are obtained when the URL is processed. Some of URL-Based Features are given below.

  • Digit count in the URL
  • Total length of URL
  • Checking whether the URL is Typosquatted or not. (google.com → goggle.com)
  • Checking whether it includes a legitimate brand name or not (apple-icloud-login.com)
  • Number of subdomains in URL
  • Is Top Level Domain (TLD) one of the commonly used one?

Domain-Based Features

The purpose of Phishing Domain Detection is detecting phishing domain names. Therefore, passive queries related to the domain name, which we want to classify as phishing or not, provide useful information to us. Some useful Domain-Based Features are given below.

  • Its domain name or its IP address in blacklists of well-known reputation services?
  • How many days passed since the domain was registered?
  • Is the registrant name hidden?

Page-Based Features

Page-Based Features are using information about pages which are calculated reputation ranking services. Some of these features give information about how much reliable a web site is. Some of Page-Based Features are given below.

  • Global Pagerank
  • Country Pagerank
  • Position at the Alexa Top 1 Million Site

Some Page-Based Features give us information about user activity on target site. Some of these features are given below. Obtaining these types of features is not easy. There are some paid services for obtaining these types of features.

  • Estimated Number of Visits for the domain on a daily, weekly, or monthly basis
  • Average Pageviews per visit
  • Average Visit Duration
  • Web traffic share per country
  • Count of reference from Social Networks to the given domain
  • Category of the domain
  • Similar websites etc.

Content-Based Features

Obtaining these types of features requires active scan to target domain. Page contents are processed for us to detect whether target domain is used for phishing or not. Some processed information about pages are given below.

  • Page Titles
  • Meta Tags
  • Hidden Text
  • Text in the Body
  • Images etc.

By analysing these information, we can gather information such as;

  • Is it required to login to website
  • Website category
  • Information about audience profile etc.

All of features explained above are useful for phishing domain detection. In some cases, it may not be useful to use some of these, so there are some limitations for using these features. For example, it may not be logical to use some of the features such as Content-Based Features for the developing fast detection mechanism which is able to analyze the number of domains between 100.000 and 200.000. Another example would be, if we want to analyze new registered domains Page-Based Features is not very useful. Therefore, the features that will be used by the detection mechanism depends on the purpose of the detection mechanism. Which features to use in the detection mechanism should be selected carefully.

(Video) Phishing URL Detection Using Machine Learning

Detecting Phishing Domains is a classification problem, so it means we need labeled data which has samples as phish domains and legitimate domains in the training phase. The dataset which will be used in the training phase is a very important point to build successful detection mechanism. We have to use samples whose classes are precisely known. So it means, the samples which are labeled as phishing must be absolutely detected as phishing. Likewise the samples which are labeled as legitimate must be absolutely detected as legitimate. Otherwise, the system will not work correctly if we use samples that we are not sure about.

For this purpose, some public datasets are created for phishing. Some of the well-known one is PhishTank. These data sources are used commonly in academic studies.

Collecting legitimate domains is another problem. For this purpose, site reputation services are commonly used. These services analyse and rank available websites. This ranking may be global or may be country-based. Ranking mechanism depends on a wide variety of features. The websites which have high rank scores are legitimate sites which are used very frequently. One of the well-known reputation ranking service is Alexa. Researchers are using top lists of Alexa for legitimate sites.

When we have raw data for phishing and legitimate sites, the next step should be processing these data and extract meaningful information from it to detect fraudulent domains. The dataset to be used for machine learning must actually consist these features. So, we must process the raw data which is collected from Alexa, Phishtank or other data resources, and create a new dataset to train our system with machine learning algorithms. The feature values should be selected according to our needs and purposes and should be calculated for every one of them.

There so many machine learning algorithms and each algorithm has its own working mechanism. In this article, we have explained Decision Tree Algorithm, because I think, this algorithm is a simple and powerful one.

Initially, as we mentioned above, phishing domain is one of the classification problem. So, this means we need labeled instances to build detection mechanism. In this problem we have two classes: (1) phishing and (2) legitimate.

When we calculate the features that we’ve selected our needs and purposes, our dataset looks like in figure below. In our examples, we selected 12 features, and we calculated them. Thus we generated a dataset which will be used in training phase of machine learning algorithm.

A Decision Tree can be considered as an improved nested-if-else structure. Each features will be checked one by one. An example tree model is given below.

Generating a tree is the main structure of detection mechanism. Yellow and elliptical shaped ones represent features and these are called nodes. Green and angular ones represent classes and these are called leaves. The length is checked when an example arrives and then the other features are checked according to the result. When the journey of the samples is completed, the class that a sample belongs to will become clear.

Phishing URL Detection with ML (8)
(Video) Phishing Websites Detection System using Machine Learning Techniques | IEEE Machine Learning Project

Now, the most important question about Decision Trees is not answered yet. The question is that which feature will be located as the root? and which ones must come after the root? Choosing features intelligently effects efficiency and success rate of algorithms directly.

So, how does decision tree algorithm select features?

Decision Tree uses a information gain measure which indicates how well a given feature separates the training examples according to their target classification. The name of the method is Information Gain. The mathematical equation of information gain method is given below.

Phishing URL Detection with ML (9)

High Gain score means that the feature has a high distinguishing ability. Because of this, the feature which has maximum gain score is selected as the root. Entropy is a statistical measure from information theory that characterizes (im-)purity of an arbitrary collection S of examples. The mathematical equation of Entropy is given below.

Phishing URL Detection with ML (10)

Original Entropy is a constant value, Relative Entropy is changeable. Low Relative Entropy Score means high purity, likewise high Relative Entropy Score means low purity. As we move down the tree, we want to increase the purity, because high purity on the leaf implies high success rate.

In the training phase, dataset is divided into two parts by comparing the feature values. In our example we have 14 samples. “+” sign representing phishing class, and “-” sign representing legitimate class. We divided these samples into two parts according to the length feature. Seven of them settle right, the other seven of them settle left. As shown in the figure below, right part of tree has high purity, so it means low Entropy Score (E), likewise left part of tree has low purity and high Entropy Score (E). All calculations were done according to the equations given above. Information Gain Score about the length feature is 0,151.

The Decision Tree Algorithm calculates this information for every feature and selects features with maximum Gain scores. To growth the tree, leaves are changed as a node which represents a feature. As the tree grows downwards, all leaves will have high purity. When the tree is big enough, the training process is completed.

(Video) PRESENTATION ON DETECTING PHISHING WEBSITE USING MACHINE LEARNING BY SURAJ RAKESH GUPTA & TEAM

The Tree created by selecting the most distinguishing features represents model structure for our detection mechanism. Creating mechanism which has high success rate depends on training dataset. For the generalization of system success, the training set must be consisted of a wide variety of samples taken from a wide variety of data sources. Otherwise, our system may working with high success rate on our dataset, but it can not work successfully on real world data.

FAQs

How do I verify a phishing URL? ›

Check the Links: URL phishing attacks are designed to trick recipients into clicking on a malicious link. Hover over the links within an email and see if they actually go where they claim. Enter suspicious links into a phishing verification tool like phishtank.com, which will tell you if they are known phishing links.

What is phishing detection using machine learning? ›

In phishing detection, an incoming URL is identified as phishing or not by analysing the different features of the URL and is classified accordingly. Different machine learning algorithms are trained on various datasets of URL features to classify a given URL as phishing or legitimate.

What is phishing URL? ›

What is URL Phishing? Cybercriminals use phishing URLs to try to obtain sensitive information for malicious use, such as usernames, passwords, or banking details. They send phishing emails to direct their victims to enter sensitive information on a fake website that looks like a legitimate website.

Which of the following tool is used to detect phishing? ›

Cofense PDR (Phishing Detection and Response) is a managed service where both AI-based tools and security professionals are leveraged in concert to identify and mitigate phishing attacks as they happen.

Is there a safe way to open a suspicious link? ›

If you don't want to interact with the suspicious webpage and instead just quickly want to see what it is, the easiest and safest way to open the link is probably by using an online screen capturing service for websites (e.g., https://www.screenshotmachine.com or https://screenshot.guru).

How can I check to see if a website is safe? ›

A secure URL should begin with “https” rather than “http.” The “s” in “https” stands for secure, which indicates that the site is using a Secure Sockets Layer (SSL) Certificate. This lets you know that all your communication and data is encrypted as it passes from your browser to the website's server.

How does Google detect phishing? ›

We use this classifier to maintain Google's phishing blacklist automatically. Our classifier analyzes millions of pages a day, examining the URL and the contents of a page to determine whether or not a page is phishing.

What is machine learning? ›

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

Can you spoof a URL? ›

Website spoofing is when an attacker builds a website with a URL that closely resembles, or even copies, the URL of a legitimate website that a user knows and trusts. In addition to spoofing the URL, the attacker may copy the content and style of a website, complete with images and text.

How do phishing links work? ›

A majority of phishing links are sent via email and designed to fool the recipient into downloading a virus, giving up a credit card number, providing personal information (like a Social Security number) or offer account or login information to a particular website.

What are anti-phishing tools? ›

Anti-phishing software consists of computer programs that attempt to identify phishing content contained in websites, e-mail, or other forms used to accessing data (usually from the internet) and block the content, usually with a warning to the user (and often an option to view the content regardless).

What is anti-phishing in cyber security? ›

Anti-phishing protection refers to the security measures that individuals and organizations can take to prevent a phishing attack or to mitigate the impact of a successful attack. Certain anti-phishing protection may block email containing phishing attacks from entering a company's email system at all.

What is a safe way to see where a link leads? ›

To check if a link is safe, plug it into a link checker. Link checkers are free online tools that can analyze any link's security issues (or lack thereof) and alert you if the link will direct you to a compromised website, malware, ransomware, or other safety risks.

How do I check for malware links? ›

Google Safe Browsing is a good place to start. Type in this URL http://google.com/safebrowsing/diagnostic?site= followed by the site you want to check, such as google.com or an IP address. It will let you know if it has hosted malware in the past 90 days.

What happens if you click on a phishing link on Iphone? ›

What Happens If You Click on a Phishing Link? Clicking on a phishing link or opening an attachment in one of these messages may install malware, like viruses, spyware or ransomware, on your device. This is all done behind the scenes, so it is undetectable to the average user.

Are all https sites safe? ›

Https stands for Hyper Text Transfer Protocol Secure and uses an SSL security certificate. This certificate encrypts the communication between the website and its visitors. This means that the information you enter on the website is processed securely, so that cyber criminals cannot intercept the data.

How do you know if a website has a virus? ›

Remote website security scan

We recommend using Sucuri SiteCheck as a first step. Visit the SiteCheck website at sitecheck.sucuri.net and click Scan Website. If the site is infected, review the warning message to look for any payloads and locations.

How do I secure a website with https? ›

  1. Steps covered in this article.
  2. Generating keys and certificate signing requests. Generate a public/private key pair. Generate a certificate signing request. ...
  3. Enable HTTPS on your servers.
  4. Make intrasite URLs relative.
  5. Redirect HTTP to HTTPS.
  6. Turn on Strict Transport Security and secure cookies. Search ranking. Performance.
Jan 28, 2022

How can phishing be prevented? ›

Never provide your personal information in response to an unsolicited request, whether it is over the phone or over the Internet. Emails and Internet pages created by phishers may look exactly like the real thing. They may even have a fake padlock icon that ordinarily is used to denote a secure site.

How do I report a fake website? ›

Report the scam to the FTC online, or by phone at 1-877-382-4357 (9:00 AM - 8:00 PM, ET).

How ML is helpful in website development? ›

They Help Maximize User Experience (UX)

A website with AI and ML features helps you analyze your customer's preferences, search history, and even location. This way you will be able to design and improve your website according to the needs of your customers.

What is difference between ML and AI? ›

An “intelligent” computer uses AI to think like a human and perform tasks on its own. Machine learning is how a computer system develops its intelligence. One way to train a computer to mimic human reasoning is to use a neural network, which is a series of algorithms that are modeled after the human brain.

Is AI same as ML? ›

Are AI and machine learning the same? While AI and machine learning are very closely connected, they're not the same. Machine learning is considered a subset of AI.

How do phishing links look like? ›

Just look for some red flags on the link. Fake links generally imitate established websites, often by adding unnecessary words and domains. You should also make sure to hover over any hyperlinked text before clicking.

What do phishing websites look like? ›

A phishing website looks similar to the original one as cybercriminals copy the theme, information, graphics, and other intricate details. It may link some of the pages (like contact us or careers) to those of the original website. It often uses the name of the original website.

How can I check a link? ›

General Link Safety Tips
  1. Scan the Link With a Link Scanner.
  2. Turn on Real-Time or Active Scanning in Anti-Malware Software.
  3. Keep Your Anti-Malware and Antivirus Software Up to Date.
  4. Consider Adding a Second-Opinion Malware Scanner.
May 2, 2022

What happens if you open a spam link? ›

What Happens If You Click on a Phishing Link? Clicking on a phishing link or opening an attachment in one of these messages may install malware, like viruses, spyware or ransomware, on your device. This is all done behind the scenes, so it is undetectable to the average user.

What makes a URL suspicious? ›

Take an extra second to inspect URLs for suspicious misspellings, punctuation, or possibly long and garbled text in the address bar. If you're using a web browser that only displays the domain instead of the full address, you might need to click on the address bar to reveal the whole URL.

How do you know if a website is fishing? ›

Here are some tips to help you identify a phishing website:
  1. Visit Website Directly. ...
  2. Be Wary of Pop-Ups. ...
  3. Non-Secured Sites. ...
  4. Pay Close Attention to the URL or Web Address. ...
  5. Enter a Fake Password. ...
  6. Evaluate the Content and Design of the Website. ...
  7. Refer to Online Reviews. ...
  8. A Website's Payment Methods.
Mar 11, 2021

How do you tell if a link has a virus? ›

To check if a link is safe, plug it into a link checker. Link checkers are free online tools that can analyze any link's security issues (or lack thereof) and alert you if the link will direct you to a compromised website, malware, ransomware, or other safety risks.

What is the best link checker? ›

Comparison of Top 5 Dead Link Checker Tools
ToolsRatingsType
Google Webmaster4.5/5Online Tool
Dead Link Checker4.2/5Online Tool
Xenu's link Sleuth3.5/5Desktop Application
Ahrefs Broken Link Checker3.5/5Online Tool
4 more rows
Sep 24, 2022

How do I know if my URL is working? ›

How To Check Website Availability | Step-by-Step Guide
  1. Step One – Choose Testing Tool. The first step is also the most crucial. ...
  2. Step Two – Input Website URL (Speed Test) Enter your website URL in the search field and choose your browser. ...
  3. Step Three – Review Your Results. ...
  4. Step Four – Adjust Website Based on Results.
Jul 10, 2018

Should I be worried if I clicked on a phishing link? ›

A Hacker May Receive Information From or About You

If you click on a phishing link, the attacker will automatically receive some basic data, such as your device statistics, approximate location and any other information you may have voluntarily provided.

What if I clicked on a phishing link but did not enter details? ›

If you clicked on a phishing link that took you to a spoofed page and did not enter any personal information or credentials, then you should be fine. However, one danger is that scammers usually know whether or not you clicked on the link. So, they may determine you're a good target to continue pursuing.

What to do when you click on a phishing link? ›

7 steps to take if you accidentally clicked on a phishing link:
  1. This goes without saying, but do not enter any information. ...
  2. Disconnect from the internet immediately. ...
  3. Backup everything on your device. ...
  4. Scan your system for malware. ...
  5. Change your login credentials. ...
  6. Scan other devices that share the network.

Videos

1. Machine Learning for Security Analysts - Part 3: Malicious URL Predictor
(Netsec Explained)
2. Phishing Website Detection by Machine Learning Techniques
(Shreya Gopal Sundari)
3. Detecting Malicious Urls with Machine Learning In Python
(JCharisTech)
4. Malicious URL Detection Using Machine Learning in Python | NLP
(Wisdom ML)
5. Phishing Sites Prediction Using Machine Learning
(Tarun Tiwari)
6. Detecting Phishing Websites using Machine Learning Technique
(Nevon Projects)

Top Articles

Latest Posts

Article information

Author: Fredrick Kertzmann

Last Updated: 11/14/2022

Views: 6721

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Fredrick Kertzmann

Birthday: 2000-04-29

Address: Apt. 203 613 Huels Gateway, Ralphtown, LA 40204

Phone: +2135150832870

Job: Regional Design Producer

Hobby: Nordic skating, Lacemaking, Mountain biking, Rowing, Gardening, Water sports, role-playing games

Introduction: My name is Fredrick Kertzmann, I am a gleaming, encouraging, inexpensive, thankful, tender, quaint, precious person who loves writing and wants to share my knowledge and understanding with you.