The Technicalities of AI-Powered Business Name Generation

Full Project File: https://github.com/jakeww/nameGeneratorAI

Introduction

The objective of this project is to create an AI-powered business name generator using public data from the state of Oregon. This document provides a detailed overview of the process and methodology employed in achieving this goal.

Data Collection

The initial attempt was to download LLC data from the state of Colorado, however, no comprehensive registry was found. As such, research was conducted to identify alternative sources of data, resulting in the successful acquisition of business data from the state of Oregon. This data was in the form of a 400MB file, which was too large to be opened by conventional software such as Apple's Numbers or Google Sheets. Therefore, an API call was implemented to retrieve only business names. After running the API call, over 2,000,000 business names were accumulated, which was deemed sufficient for this project. However, due to technical difficulties with the data such as special characters or repetition, the final count was limited to 1,436,742.

Data Pre-Processing

The data obtained was pre-processed to prepare it for further use in the project. The pre-processing step is crucial for ensuring that the data is in a format that can be easily utilized by the AI model.

The steps taken for pre-processing the data were as follows:

Set all alphabetic characters to lowercase - This step was taken to ensure that all the data is in a consistent format and that the model is not affected by any variations in case. This step also helps in reducing the size of the data as there are fewer unique words when all characters are in lowercase.
Strip all special characters, such as commas, hyphens, apostrophes, etc - This step is taken to remove any unwanted characters that may affect the performance of the model. These special characters could cause the model to interpret the data incorrectly or lead to errors during the training process.
Tokenize data by splitting it into individual words - This step is taken to split the data into individual words or tokens. This step allows the model to better understand the context of the data and make more accurate predictions. The tokenization process also helps to reduce the data size by removing any unnecessary spaces or punctuation.

Model Implementation

A Markov Chain algorithm was employed to generate results. The Markov Chain algorithm is a mathematical model for generating sequences of data, and it is widely used in natural language processing, speech recognition, and other similar tasks. The results generated by the Markov Chain algorithm were not perfect, and some of the generated names may not make sense or be suitable for a business. However, it was deemed suitable for the first version of the AI business name generator as it provided a good starting point and a solid foundation for further improvements. With further development and fine-tuning, the algorithm could generate even more accurate and suitable business names.

Conclusion

This project has been successful in demonstrating the feasibility of using AI to generate business names using public data from the state of Oregon. The process of obtaining, pre-processing, and modeling the data using a Markov Chain algorithm is an essential aspect of the project. The data was collected from publicly available sources, and then pre-processed to ensure that it was in a format that could be easily utilized by the AI model. The Markov Chain algorithm was then used to generate possible business names by analyzing the patterns and relationships between the words in the pre-processed data.

Although the results generated by the Markov Chain algorithm were not perfect, it is still deemed suitable for the first version of the AI business name generator. The generated names provide a good starting point for further development, and the algorithm could be fine-tuned to generate even more accurate and suitable business names.

Future Outlook

The plan for future development is to further analyze the results generated by the Markov Chain algorithm using GPT-2. GPT-2 is a large language model that is capable of generating human-like text. It can be used to improve the accuracy and relevance of the generated business names by analyzing the patterns and relationships between the words in the generated names. The use of GPT-2 will also help to ensure that the generated names are grammatically correct and are in a format that is suitable for use as a business name.