Since we are writing a program that makes use of the public infrastructure that is the Internet, it makes sense to play fair and make our programs behave properly so that we can avoid clashes with the webmasters, or even worse retaliation.

Webmasters may not want their site to be spidered by other computers, and they have a way to say it clearly with the robots.txt file. Inside a robots.txt file, which is a simple text file stating one or more rules, a webmaster can describe if scraping is allowed at all, or otherwise limit scraping to only some files and folders, or only to some specific spiders.

Our plan to playing fair is to read the site’s robots file, and stay well clear of the actual content if we are not allowed to read it. The wisest thing to do here is to stand on the shoulders of giants, since there are already other people who have done this in the past. And luckily some of these folks have shared their efforts for everyone else to use. Which means that we do not have to reinvent the wheel, and will only write a few lines of code to read the robots.txt file and make our program compliant.

What we are going to do is to use the package manager inside Visual Studio in order to add and reuse the existing code from the RobotsTxt author Çağdaş Tekin. So first open the solution and from the Project menu choose the item that says ‘Manage NuGet Packages’. A new window will open. Choose the online option on the left and in the ‘Search Online’ box enter RobotsTxt and press enter. A list with the relevant packages will be shown:

Install the one that says ‘RobotsTxt: A robots.txt parser for .Net’. When prompted to select the project to install to, just click OK since at the moment we only have one project. The package will be installed, and a green circle with a checkmark will confirm this. We can close the package manager window and get back to our task at hand.

We will now add a new function called CheckRobots, that uses RobotsTxt to determine if we are allowed to spider the page that the user requested. Add the following code at the bottom of Form1.cs:

private bool CheckRobots(string url)
{
var robotsFileLocation = new Uri(url).GetLeftPart(
UriPartial.Authority) + "/robots.txt";
var robotsFileContent = client.DownloadString(robotsFileLocation);
Robots robots = Robots.Load(robotsFileContent);
return robots.IsPathAllowed("keywordChecker", url);
}

This function gets a url as a string parameter and returns a boolean value to signify if this url can be accessed by a robot or not. In the function’s first line, we concatenate the robots.txt filename to the base url and store it in the robotsFileLocation variable. This new url is where we are going to look for the robots file and download it. Then we download the actual robots.txt file from the website and store it in robotsFileContent, and next we ask the RobotsTxt module to scan the file and tell us if we are allowed to read the page or not. This is done in two steps. First we load the file with Robots.Load and then return the result with robots.IsPathAllowed.

This should work on sites that explicitly state their conditions with the robots.txt file. Of course on the web, we can encounter a whole lot of other possible situations that can cause an exception when reading the file. In this case if the robots.txt file does not exist or is unreadable, we will assume that we are pretty much allowed to read anything on the particular website. To do this, we will embed our code inside a try catch statement like such:

    var robotsFileLocation = new Uri(url).GetLeftPart(
UriPartial.Authority) + "/robots.txt";
try
{
var robotsFileContent = client.DownloadString(
robotsFileLocation);
Robots robots = Robots.Load(robotsFileContent);
return robots.IsPathAllowed("keywordChecker", url);
}
catch
{
return true;
}

An exception is a condition inside our running code where the program’s flow deviates from our ideal outcome. For example in our case we are trying to read the robots.txt file, but we haven’t managed it somehow. This will cause an exception to be triggered by the code. We did not use any exception handlers so far, but it’s wise to use them in your code. A good programmer always tries to think about what unexpected situations can arise while the end user is working with the program and adds code so that exceptions do not blow up in the face of users.

If we expect that some part of our code can cause an exception, we embed this code inside a try block. Then the code that we want to run once an exception is triggered is added inside a catch block. In our case, if we cannot read the file, an exception is thrown and when we catch it the function returns the value true to signify that the url is allowed. Another code structure that we are not using here is the finally block. The finally block is used to run any code regardless if there is an exception or not, and is normally used to add some cleanup code for any opened streams or dispose of any unwanted objects. But you do not need to know this right now, so let’s move on.

Actually, the last thing we need to do is to call the new CheckRobots function from our button click event, encapsulate the code from the previous iteration inside an if statement, to be run only if we are allowed, or otherwise display a message to the user that the site is not allowed to be spidered.

private void btnCheck_Click(object sender, EventArgs e)
{
client = new WebClient();
var url = txtUrl.Text;
url = !string.IsNullOrEmpty(url) && Uri.IsWellFormedUriString(url,
UriKind.Absolute) ? url : "http://www.gametrailers.com";
var keywords = txtKeywords.Text;
keywords = !string.IsNullOrEmpty(keywords) ? keywords :
"final fantasy";

if (CheckRobots(url))
{
var pageContent = client.DownloadString(url);
var keywordLocation = pageContent.IndexOf(keywords,
StringComparison.InvariantCultureIgnoreCase);
StringBuilder sb = new StringBuilder();
if (keywordLocation >= 0)
{
var pageIds = Regex.Matches(pageContent, @"id=""\s*?\S*?""");
string matchedId = closestId(keywordLocation, pageIds);
string idTag = matchedId.Substring(4, matchedId.Length - 5);
brwPreview.Navigate(url + "#" + idTag);
sb.AppendFormat("{0} are talking about {1} today.", url,
keywords);
sb.Append("\n\nSnippet:\n" + pageContent.Substring(
keywordLocation, 100));
sb.AppendFormat("\n\nClosest id: {0}", idTag);
}
else
{
sb.Append("Keyword not found!");
}
lblResult.Text = sb.ToString();
}
else
{
lblResult.Text = "Blocked by robots.txt!";
}
}

We can observe that the program is becoming more modular now. What this means is that the programs is split in different modules, each with a specific function that can be reused in this and any future programs. You may have noticed that we are already reaping the benefits of modularity by re-using code for a function we need developed by someone else, but we will expand more on modularity in the next installment.

Once again, the full source code for this tutorial is available at GitHub.

0 comments

Add your comments