Building a website of keywords
This is a short project to explore creating keywords and creating webpages programmatically. (Full project on github). This is an older project that I cleaned up a bit recently.
I first scan through the abstracts from the ACM Journals (stored in a .txt file), pulling out all words, ignoring stopwords, and correcting common misspellings.
for i in abstracts: # replace noise characters abs_words = i.replace(';', ' ').replace(',', ' ').replace(':', ' ').replace('-', ' ').replace('.', ' ') \ .replace('(', ' ').replace(')', ' ').replace('{', ' ').replace('}', ' ').replace('0', ' ').replace('1', ' ') \ .replace('2', ' ').replace('3', ' ').replace('4', ' ').replace('5', ' ').replace('6', ' ').replace('7', ' ') \ .replace('8', ' ').replace('9', ' ').replace("'", ' ').replace('"', ' ').replace(']', ' ').replace('[', ' ') \ .replace('“', ' ').replace('”', ' ').replace('?', ' ').replace('=', ' ').replace('&', ' ').split() for word in abs_words: word = word.lower() # if word is in misspellings list, replace with correct spelling for key in corrections: word = word.replace(key, corrections[key]) # check word is not in stopwords list if word not in stopwords_list: keywords.append(word)
This list of words is then counted and sorted, keeping only the top 35 keywords.
The abstracts are in one big text file, so I parse them into a list with each abstract in it’s own list item.
# until reach the end of file while aline != '': # create a (new) blank list for the articles article = [] # until reach the newline that's between each article while aline != '\n': aline = aline.rstrip('\n') # append to list called article article.append(aline) # read the next line aline = abstracts.readline() # Get the whole abstract together new_entry = ' '.join(article) # append the individual entries to the overall list abstract_list.append(new_entry) # read in the next line aline = abstracts.readline()
Next, I build a page for each keyword, with the abstracts that contain that keyword
# Building pages for each keyword # take each keyword for word in top_keywords: indices = [] count = 0 # compare each keyword to the abstracts for listing in abstract_list: articleholder = [] # if keyword is in the abstract, append to a list # of articles for that keyword if (word in str(listing)): # then split on the " that surround the titles articleholder = listing.split('"') # take the second part of the split (title) and add link # use rstrip to remove ending commas link = '<a href="article' + str(count) + '.html">' + \ str(articleholder[1]).rstrip(',') + '</a>' # append the first part of the split (author) and second part (title) # to create a listing # use rstrip to remove ending commas key_listing = str(articleholder[0]).rstrip(', ') + '<br>' + link # append this to a list for the keyword page indices.append(key_listing) # create page for the abstract with line breaks # use rstrip to remove ending commas abs_page = '<u> Abstract</u>' + '<br> <br>' + \ '<i>' + str(articleholder[0]).rstrip(', ') + '</i>' + '<br>' + \ '<b>' + str(articleholder[1]).rstrip(',') + '</b>' + '<br> <br>' + \ str(articleholder[2]) # write out HTML for that article filename = 'article' + str(count) + '.html' keyword_file = open(filename, "w") keyword_file.write(abs_page) keyword_file.close() count += 1 # build keyword abstract listing page, # including line breaks between abstracts page = '\n \n <br><br> \n'.join(indices) ### Print out the files filename = word + '.html' keyword_file = open(filename, "w") keyword_file.write(page) keyword_file.close()
Then finally bring it all together with a simple index page.
# build HTML document html_begin = """ <!DOCTYPE html> <html> <body> <h1>Welcome to the ACM Library</h1> <h3>A research, discovery and networking platform</h3> <h2>Browse our library by keyword. <h3 id="keyword"><h3>Keywords</h3></a> <ul> """ html_end = """ </ul> </body> </html>""" # build HTML doc html_str = html_begin + li_list + html_end # write out HTML html_file = open("index.html", "w") html_file.write(html_str) html_file.close() abstracts.close()
Next Steps
If I continue this project, I’d like to explore creating directories for the keyword pages and the abstract pages (conditionally if they directories don’t already exist). I could also come back to improve the HTML or add some CSS to make the pages look better.
I would also go back and break up main()
into a number of smaller functions.