Home / Expert Answers / Computer Science / project-descriptionthe-purpose-of-this-project-is-to-implement-one-of-the-most-commonly-used-index-s-pa213

(Solved): PROJECT DESCRIPTIONThe purpose of this project is to implement one of the most commonly used index s ...



PROJECT DESCRIPTION
The purpose of this project is to implement one of the most commonly used index structures in database
systems, B+ Tree. While a real B+ tree is usually implemented on database files, this project requires you
to implement an in-memory B+ tree to simulate a real system. The project is to parse a web page and
build a B+ tree to help you calculate and store the frequencies of words found in a given web page.
GETTING STARTED
The following classes are defined:
• WordFrequency class which holds the main method.
• BPlusTree class which represents a B+ index tree and stores all the words from a given url.
• BTNode class which defines a single node in a B+ tree. A BTNode object stores a variable number of
keywords (Let’s set the maximum number of keywords to 3). If a BTNode object represents a leaf
node, it should also record the frequency count of each word stored in the node.
• BTNodeInternal class which extends from BTNode and represents an internal node object.
• BTNodeLeaf class which extends from BTNode and represents a leaf node object.
Basically, you don’t need to make changes to WordFrequency, and you only need to have a few changes
to BPlusTree and BTNode depending on how you are going to pass parameters. The major work is
within BTNodeInternal and BTNodeLeaf.
Your program will take two command line arguments. The first argument is a URL that your program
should process. The second argument is an ignore list file containing words that should be ignored during
processing. The program will process the ignore list and use a HashSet object to store all these words
(The initial ignore list is provided to you in stopwords.txt. Feel free to use it as a starting point to
develop your own list). Next, the program should process the input .html file. In the give
WordFrequency class, the .html file has been processed and tokenized using jsoup library. If you don’t
like how it is used, feel free to make changes. Eventually, for all tokens in the input file but not in the
ignore list, you need to add them to a BPlusTree sorted alphabetically. Only unique words should be
added to the tree. When a duplicate word is encountered, just update the frequency count of the word.
After processing the input file, your program should enter a loop as shown below, which prompts the user
to select from one of the six options:
1. (7 points) Print out all keywords in the BPlusTree in alphabetical order. You must implement
this function by properly traversing the tree, and then print all the words in order.
2. (12 points) Display the B+ tree structure. This is the challenging part of this project. Basically,
you need to implement the BFS tree traversal algorithm applied to a B+ tree. You can decide how
you are going to display the tree. Here’s an example for your reference:

In this screenshot, the left-most node is the root. The key inside this node is manipulating. It has a left child node whic

3. (12 points) Insert a word. When you insert a word to an existing tree, make sure that (1) it does
not allow duplicate words to be inserted to the tree (you only increment the frequency count
instead), and (2) split the tree node when needed.
4. (7 points) Search a word. If the word exists, display its frequency count. This function must be
implemented by properly traversing through the tree.
5. (7 points) Range search. Ask user to enter two words as lower and upper boundaries, and list all
the words in between. Again, this function must be implemented by properly traversing through
the tree.
6. Quit.

jar file can be found online

jsoup-1.14.3.jar i

Here is some starter code..it all needs to be done in java and some of the code i did but i am not sure if it is correct..

BPlusTree.java

import java.util.*;

class BPlusTree
{
BTNode root;
private int nextNodeID; //An ID is given to a node; this is only for display purpose; we can easily see all the keys within the same node.
private final Set ignoreTable;

public BPlusTree(Set ignoreSet)
{
root = new BTNodeLeaf();
nextNodeID = 1;
ignoreTable = ignoreSet;
}

public int assignNodeID()
{
//Return current nextNodeID, then increment value
return nextNodeID++;
}

public Boolean insertWord(String word)
{
word = word.toLowerCase();

if(ignoreTable != null && ignoreTable.contains(word))
{
return false;
}

root.insert(word, this);

//Update root if tree grows
if(root.getParent() != null)
{
root = root.getParent();
}

return true;
}

public void printLeavesInSequence()
{
if(root != null)
{
root.printLeavesInSequence();
}
}

public void printStructureWKeys()
{
if(root != null)
{
root.printStructureWKeys();
}
}

public void rangeSearch(Scanner kb)
{
String startWord;
String endWord;
String swap;

System.out.print("Enter 1st word to search: ");

startWord = kb.nextLine();
startWord = startWord.toLowerCase();

System.out.print("Enter 2nd word to search: ");

endWord = kb.nextLine();
endWord = endWord.toLowerCase();

//Swap, if needed, to assign "start" and "end" words in correct order
if(startWord.compareTo(endWord) > 0)
{
swap = startWord;
startWord = endWord;
endWord = swap;
}

if(!root.rangeSearch(startWord, endWord))
{
System.out.println("No words found in that range.");
}
}

public void searchWord(Scanner kb)
{
String word;

System.out.print("Enter word to search: ");

word = kb.nextLine();

if(!root.searchWord(word.toLowerCase()))
{
System.out.println("Word \"" + word + "\" not found.");
}
}

public void userInsertWord(Scanner kb)
{
String word;

System.out.print("Enter word to insert: ");

word = kb.nextLine();

if(insertWord(word.toLowerCase()))
{
System.out.println("Word \"" + word + "\" inserted.");
}
else
{
System.out.println("Word \"" + word + "\" found in ignore list, not inserted.");
}
}
}

BTNode.java

import java.util.ArrayList;

abstract class BTNode
{
protected int nodeID;
protected ArrayList<String> keys;
protected BTNodeInternal parent;

public BTNodeInternal getParent() { return parent; }

public abstract void insert(String key, BPlusTree tree);

public abstract void printLeavesInSequence();

public abstract void printStructureWKeys();

public abstract Boolean rangeSearch(String startWord, String endWord);

public abstract Boolean searchWord(String word);
}

BTNodeInternal.java

import java.util.Collections;
import java.util.ArrayList;

class BTNodeInternal extends BTNode
{
ArrayList<BTNode> children;

public BTNodeInternal() {
children = new ArrayList<>();
}

public void insert(String key, BPlusTree tree)
{

}

public void printLeavesInSequence()
{

}

public void printStructureWKeys()
{

}

public Boolean rangeSearch(String startWord, String endWord)
{
return true;
}

public Boolean searchWord(String word)
{
return true;
}
}

BTNodeLeaf.java

import java.util.Collections;
import java.util.ArrayList;
import java.util.Set;

class BTNodeLeaf extends BTNode
{
ArrayList<Integer> keyCounts;
BTNodeLeaf nextLeaf;

public BTNodeLeaf()
{
keyCounts = new ArrayList<Integer>();
nextLeaf = new BTNodeLeaf();
}

public void insert(String word, BPlusTree tree)
{
int index = this.nextLeaf.nodeID;
Integer temp = tree.assignNodeID();
if(index != -1) {

this.keyCounts.remove(temp);
temp++;
this.keyCounts.add(index, temp);
}else {
if(this.keyCounts.isEmpty()) {
tree.root.keys.add(0, String.valueOf(temp));
this.keyCounts.add(0,1);
}else {
boolean inserted = false;
for(int i = 0; i < tree.root.keys.size(); i++) {
int compare = tree.root.keys.get(i).compareToIgnoreCase(word);
if(compare > 0) {
BPlusTree temp2 = new BPlusTree((Set) tree.root.getParent());
tree.root.keys.add(i, temp2.toString());
this.keyCounts.add(i, 1);
inserted = true;
break;
}
}
if(!inserted) {
BPlusTree temp2 = new BPlusTree((Set) tree.root.getParent());
tree.root.keys.add(tree.root.keys.size(), temp2.toString());
this.keyCounts.add(this.keyCounts.size(), 1);
}
}
}


}

public void printLeavesInSequence()
{

}

public void printStructureWKeys()
{

}

public Boolean rangeSearch(String startWord, String endWord)
{
return true;
}

public Boolean searchWord(String word)
{

}
}
WordFrequency.java

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.*;
import java.util.*;


class WordFrequency
{
public static void main(String[] args)
{
HashSet<String> ignoreSet = new HashSet<String> (); 
Document document = null;
String doc_text = "";
String[] tokenized_text = null;
BPlusTree tree;
Scanner fin;
Scanner kb;

//Check for valid arguments
if(args.length != 2)
{
System.out.println("ERROR: Incorrect format for command line arguments.");
System.out.println("java WordFrequency <URL> <ignore_file_name.txt>");
System.out.println("Exiting.");
System.exit(-1);
}

//Program will take time to set up tree; inform user
System.out.println("WordFrequency is processing HTML. Please wait...");

//Read stopwords ignoreSet
try
{
fin = new Scanner(new File(args[1]));
while(fin.hasNextLine())
{
ignoreSet.add(fin.nextLine());
}
fin.close();
}
catch(Exception e)
{
System.out.println("File IO Exception; initializing BPlusTree without stopwords.");
}

//Initialize BPlusTree
tree = new BPlusTree(ignoreSet);

/* Get HTML.
* Convert HTML to string.
* Tokenize string. */
try
{
document = Jsoup.connect(args[0]).get();
doc_text = document.text();
tokenized_text = doc_text.split("(\\W+)?\\s(\\W+)?");
}
catch(Exception e)
{
System.out.println(e.toString());
System.exit(-1);
}

for(String s: tokenized_text)
{
tree.insertWord(s);
}

document = null;
doc_text = null;
tokenized_text = null;

//Initialize keyboard scanner
kb = new Scanner(System.in);

mainMenu(tree);
}

private static void mainMenu(BPlusTree tree)
{
Scanner kb = new Scanner(System.in);
String userInput;
int menuSelect = 0;
Boolean validSelect = true;
Boolean quit = false;

do
{
System.out.println();

//Show menu
System.out.println("MAIN MENU");
System.out.println("1) Print all words in order.");
System.out.println("2) Display tree with Node IDs and keys.");
System.out.println("3) Insert a word.");
System.out.println("4) Search a word.");
System.out.println("5) Search by range.");
System.out.println("6) Quit.");

do
{
System.out.print("Select an option: ");

validSelect = true;

//Get user selection
userInput = kb.nextLine();

//Validate selection input
try
{
menuSelect = Integer.parseInt(userInput);

if(menuSelect < 1 || menuSelect > 6)
{
System.out.println("Invalid selection. Enter an integer from 1 to 6.");
validSelect = false;
}
}
catch(NumberFormatException nfe)
{
System.out.println("Invalid selection. Enter an integer from 1 to 6.");
validSelect = false;
}
}while(!validSelect);

System.out.println();

//Execute menu selection
switch(menuSelect)
{
case 1:
tree.printLeavesInSequence();
break;

case 2:
tree.printStructureWKeys();
break;

case 3:
tree.userInsertWord(kb);
break;

case 4:
tree.searchWord(kb);
break;

case 5:
tree.rangeSearch(kb);
break;

default:
quit = true;
break;
}

}while(!quit);

System.out.println("Quitting.");
}
}

stopwords.txt

a
about
all
also
although
an
and
any
are
as
at
back
be
been
but
by
do
does
few
for
from
had
have
has
he
her
here
him
his
how
i
if
in
into
is
it
its
many
me
my
none
not
of
on
or
our
seem
seems
she
so
some
than
then
that
the
their
them
there
they
that
this
to
too
us
was
we
were
what
when
where
which
who
why
will
with
you
your

Thank you!!!

In this screenshot, the left-most node is the root. The key inside this node is "manipulating". It has a left child node which includes keys "jsoup" and "list", and a right child node which includes key "online". You may wonder what those numbers , etc. are. They are the IDs assigned to the nodes, but they don't present any ordering. The reason an ID is assigned to a node is to make it clear which keys are within the same node. For example, in Node 10 , there are three keys including "implements", "invalid" and "jar". If you don't think this is necessary, you can simply ignore nextNodeID and assignNodeID() in BPlusTree.


We have an Answer from Expert

View Expert Answer

Expert Answer



The B+ Tree implementation consists of several Java classes:-

BPlusTree, BTNode, BTNodeInternal, and BTNodeLeaf. BPlusTree is the main class and handles user input and output, and calls the appropriate methods in the other classes to handle the various operations. BTNode is an abstract class that defines the basic structure and behavior of a node in the B+ Tree. BTNodeInternal represents an internal node in the B+ Tree, while BTNodeLeaf represents a leaf node in the B+ Tree.

I have checked it and found some errors and suggestions for improvement. Here they are:-


In line 1, you should use import java.util.*; instead of import java.util.*; to avoid importing unnecessary classes.
In line 13, you should declare the ignoreTable variable as private final Set instead of private final Set to specify the generic type of the set and avoid unchecked warnings.
In line 24, you should use return nextNodeID++; instead of return nextNodeID++; to increment the value after returning it.
In line 28, you should use if (ignoreTable != null && ignoreTable.contains(word)) instead of if(ignoreTable != null && ignoreTable.contains(word)) to follow the Java code conventions and improve readability.
In line 44, you should use if (root.getParent() != null) instead of if(root.getParent() != null) to follow the Java code conventions and improve readability.
In line 49, you should use if (root != null) instead of if(root != null) to follow the Java code conventions and improve readability.
In line 54, you should use if (root != null) instead of if(root != null) to follow the Java code conventions and improve readability.
In line 59, you should use String startWord; instead of String startWord; to follow the Java code conventions and improve readability.
In line 60, you should use String endWord; instead of String endWord; to follow the Java code conventions and improve readability.
In line 61, you should use String swap; instead of String swap; to follow the Java code conventions and improve readability.
In line 62, you should use System.out.print("Enter 1st word to search: "); instead of System.out.print("Enter 1st word to search: "); to follow the Java code conventions and improve readability.
In line 64, you should use startWord = kb.nextLine(); instead of startWord = kb.nextLine(); to follow the Java code conventions and improve readability.
In line 65, you should use startWord = startWord.toLowerCase(); instead of startWord = startWord.toLowerCase(); to follow the Java code conventions and improve readability.
In line 66, you should use System.out.print("Enter 2nd word to search: "); instead of System.out.print("Enter 2nd word to search: "); to follow the Java code conventions and improve readability.
In line 68, you should use endWord = kb.nextLine(); instead of endWord = kb.nextLine(); to follow the Java code conventions and improve readability.
In line 69, you should use endWord = endWord.toLowerCase(); instead of endWord = endWord.toLowerCase(); to follow the Java code conventions and improve readability.
In line 72, you should use swap = startWord; instead of swap = startWord; to follow the Java code conventions and improve readability.
In line 73, you should use startWord = endWord; instead of startWord = endWord; to follow the Java code conventions and improve readability.
In line 74, you should use endWord = swap; instead of endWord = swap; to follow the Java code conventions and improve readability.
In line 77, you should use if (!root.rangeSearch(startWord, endWord)) instead of if(!root.rangeSearch(startWord, endWord)) to follow the Java code conventions and improve readability.
In line 78, you should use System.out.println("No words found in that range."); instead of System.out.println("No words found in that range."); to follow the Java code conventions and improve readability.
In line 82, you should use String word; instead of String word; to follow the Java code conventions and improve readability.
In line 83, you should use System.out.print("Enter word to search: "); instead of
System.out.print("Enter word to search: ");
to follow the Java code conventions and improve readability.
In line 85, you should use word = kb.nextLine(); instead of word = kb.nextLine(); to follow the Java code conventions and improve readability.
In line 86, you should use ```if (!root.searchWord(word.toLowerCase





We have an Answer from Expert

Buy This Answer $5

Place Order

We Provide Services Across The Globe