Proxy Load Testing

sean's picture

I had a customer that had problems with their Blue Coat appliance. They were complaining that the box was slowing down as their traffic increased. In my initial investigation, I thought that it could be possible, if there were putting too much traffic through the Blue Coat. But, after talking to the customer, they were not pumping enough traffic through the Blue Coat appliance to explain why the performance was slow. So, I went to the customer with the following plan.

  • Make sure the appliance is running the latest software.
  • Disable any authentication.
  • Make sure all the features are licensed properly.
  • Generate a high volume of traffic to cause the box to slow down.

With that plan, I felt pretty good about what I was going to do, except for the last item. How was I going to put a load on the Blue Coat appliance? I thought it would be useful if I had a tool that could help me do it. Since I have been dabbling in Python, I started to look around and see if there was something written in Python to help me with this. After a little bit of searching, I found something to start with. I found the following site. This site gave me a class which would read a web page and get the links on the page. I used that as a base for my script.

Initially, I built around the initial script to read all the links on a page and then all the links that were pulled from the page. I then thought that I could "crawl" and read the links from a site and then the links from those links and so on. Sometimes, the links on a page are interesting, like when a link leads you to downloading a large file, but in most cases the links are boring. So, I modified the "MyParser" class to not only capture the links, but also capture the images on the page. Once I had the links for the images, I would then download the images. Downloading the images made the fetch much more interesting.

Once I had a script that I felt was good at reading a website and crawl pages, I thought that the next step was to run the script multiple times to create a load, but in thinking about the loading, it would be difficult to launch all these scripts at the same time. Digging into Python, I started to look at threading and launching multiple threads with each thread reading and crawling a url.

The threading worked very well, but again, depending upon the website, a thread might complete quickly, while others could run for a long period of time. I dug into Python some more to find a way for to communicate within and between the threads. I then found queues. I set up a queue and put the initial url list in the queue and had each thread grab a url from the queue to process the url.  If I was doing any crawling, then when a thread found a new url, it would put the new url in the queue.  With the queue, all the threads would end around the same time.

While in the process, I thought it would be interesting to know how long it took to get the data that was retrieved, so I decided to keep track of how much time was being spent on each fetch and output the total per page and over all.  After all the threads stop, I output the total amount of data that was retrieved and how long it took to retrieve the data.

One problem that I ran into was trying to stop everything whenever I wanted to, like with a keyboard interrupt.  I could interrupt the main process, but the threads would run until they were done.  The first thing I had to do was catch the "KeyboardInterrupt".  In Python, catching the "KeyboardInterrupt" was easy, but trying to get everything to stop, was another issue.  The main problem is that there is no way to kill a thread even if you know the thread that you started.  Initially, I set a flag and waited for everything to finish, but I found that if you exit your script with an error code, like sys.exit( -5), everything stops, even the threads.  Finally, a way to stop when I wanted the script to stop.

Finally, since I was working with proxies, I converted the fetching from urllib to urllib2, because urllib2 supports the ability of opening the connections with an explicit proxy. The next step, would be to allow the program to do authentication when connecting to the proxy.

So, how well did it work? Pretty good. I ran the script twice with twenty threads each and crawling 1 level and I was able to sustain a load of between 15 and 20 Meg. Not bad. I found it was really dependent on the web site list given to the script and how long you repeat the script. For example, crawling YouTube uses a lot of bandwidth, but if you keep crawling the site, the amount of data gets lower and lower - I think YouTube notices I am crawling it.  Crawling TV sites, like ABC, NBC, and CNN, seems to do a good job as if you crawl, you end up downloading some of their show previews which takes up brandwidth.

AttachmentSize
html-loading.zip3.06 KB

Comments

Proxy Load Testing

nice one

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <p> <span> <div> <h1> <h2> <h3> <h4> <h5> <h6> <img> <map> <area> <hr> <br> <br /> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <em> <b> <u> <i> <strong> , <font> <del> <ins> <sub> <sup> <quote> <blockquote> <pre> <address> <code> <cite> <embed> <object> <strike> <caption>
  • Lines and paragraphs break automatically.

More information about formatting options