Home

About CodeCrawler

Note: This web site is not related to OSTnet AMS CodeCrawler. If you are trying to reach OSTnet AMS CodeCrawler, please click here.

CodeCrawler is a smart, web-based search engine specifically built for use by developers for searching source code. It combines ease of use, superb performance, and intelligent search capabilities in order to increase developer productivity and reduce source code learning time.

How It Works

An administrator installs CodeCrawler and configures source code repositories to be searched.
CodeCrawler builds a search index for the source code while analyzing each file and extracting important semantic information.
Developers search the indexed repositories and examine the code from within a web browser.

Rationale

Large-scale software development and maintenance presents many challenges to developers. As the code base grows, so do the difficulties of staying up to date with the code, finding the right place for implementing new requirements, or trying to fix existing bugs. In addition, as code size increases, sometimes documentation does not get updated or is missing altogether, which makes training new developers difficult and time consuming.

When wandering through the code, developers often use grep utilities to find a particular piece of code. These utilities let developers search source files for a match with a regular expression, and while powerful, they have three disadvantages. First, writing a regular expression for a search necessitates at least some knowledge about what you are searching for – for example, the prefix of a variable name or that the variable ends with a number. Second, the results returned from a grep are all the matches to the given regular expression and in that respect are all equally relevant. This means a developer might get hundreds of results and have no idea which one to look at. Third, grep utilities are part of the operating system or are integrated within an IDE and the results cannot be viewed from the web. These days, when so many open source projects are developed over the Internet, searching source code directly from a web browser seems natural.

With the advancement of the Internet, web search engines have become irreplaceable tools for developers. Web search engines are not as precise and powerful as grep utilities, but address the grep utilities’ disadvantages. Web search engines allow inexact matches, rank the results by relevance, and display them in a web-viewable form. However, these search engines are usually geared toward text searches, and do not work well for source code.

In programming languages, identifiers usually combine several words (i.e. ListArray or basic_string), but to search engines a word is just a word, which is one of the drawbacks of search engines. If search engines were able to split an identifier into its composing words, then search engines would be more useful to developers and programmers. Search engines also compute a relevance score for a particular result based, in part, on how many occurrences of the search keywords appear in the results. In source code, however, there is an implied understanding that some occurrences are more important than others. For example, when searching for "Foo", a document with several occurrences of a local variable "Foo" is usually less relevant than a document declaring a class "Foo". Text search engines, of course, do not have the knowledge to perform this analysis. This lack of knowledge also means developers searching for a keyword with particular semantics; say function "Foo" might get results for class "Foo" or variable "Foo" or any other "Foo". A nice addition to the search engine query syntax would be to specify the semantics of the keyword being searched, for example, "function: Foo".

CodeCrawler will combine the best of web search engines and grep tools, and extend them with knowledge about programming language syntax and source code semantics to allow more intelligent searches that more accurately determine the relevance of search results. CodeCrawler will provide a web interface that allows users to submit search queries using regular expressions (like grep), using keywords (like web searches), or using special programming specific extensions.

The results returned from the searches will be ranked by relevance, taking into account source code semantics (class, method, variable, etc.), and will point to the original source code. The system will be easy to use, configurable, and responsive. CodeCrawler will support as many programming languages as possible, and will be easily extendable with support for new programming languages.

The users of CodeCrawler are expected to be software developers and programmers familiar with the syntax of at least one programming language and regular expressions. CodeCrawler requires an administrator to maintain the indexes and connections for accessing the source code.

What's New
CodeCrawler 2005 released Introducing the New Code Detective for Developers. [ more ]