Analysing 260.000 Text Documents

An end to end NLP project to find trend and discover topics.

Gunardi Ali
11 min readApr 28, 2023

Article 1/5: Collecting 260.000 Text Document.

Photo by Annie Spratt on Unsplash

Content:

A. Background and a usecase

B. Scope of this article

C. Theory

C.1. Public data hosted on Google BigQuery

C.2. Storing relevant data locally using SQLite

D. Code and Explanation

My motivation to write this series: to help people who are one or two steps behind me learning Machine Learning. I write it in a way so that you could click along. I am a self-taught Python hobby programmer and started to study data science since summer 2022. If something’s unclear or you have suggestions, feel free to contact me here.

A. Background and a use case

I finished an end to end NLP (Natural Language Processing) project, which analyzed 264.148 text data collected from StackOverflow. I applied data exploratory analysis and used an unsupervised Machine Learning (ML) algorithm called Latent Dirichlet Algorithm (LDA). LDA is used to discover recurring topics from ~260k text data. My use case for this project is to identify the trend of “security-related topics on configurable softwares” using discussions (text data) from StackOverflow. You can read the complete PDF report about this project here and my freshly made website here. The topic clustering visualisation can be viewed here.

Methodological Overview of This NLP-Project

As can be seen above, the machine learning workflow (LDA) starts just before the last step (visualisation). Confession to make: At least 80% of my coding efforts were not related to machine learning, but rather to collect, prepare and transform data. Some considerations must be made to decide how to manage, store and prepare the data to ensure it is usable and compatible as input for ML workflow. Unfortunately, these crucial steps are under-emphasised in many publications.

B. Scope of this article

In this article, I describe only the “Data Collection” step (refer to screenshot).

What you will learn from this article:

  • Collecting public data hosted on Google BigQuery,
  • Filtering on BigQuery to get only relevant data,
  • Store text data locally using SQLite + Python programming language.

C. Theory

C.1. Public data hosted on Google Bigquery

BigQuery is a data warehouse solution from Google. BigQuery has its own SQL dialect and supports standard SQL dialect. If you have experience in SQL, you shouldn’t have any problem using BigQuery. If not, well you could learn now, because SQL is essential for data scientist and above all easy to learn. You could start by following along with this article. To collect public data and transform it from BigQuery, you just need to register yourself on Google. It is free and requires no credit card.

Skip to 2:20 to see BigQuery on action.

Many public data are hosted on BigQuery. The public data I used for this project is text dataset from Q&A discussions on StackOverflow.

C.2. Storing relevant data locally using SQLite

After collecting, filtering and transforming public data from BigQuery, it is necessary to store it locally. Well this is only half true for you, because you could get all functioning codes from this article series. But in reality, when you are doing a project, it involves a lot of trial and error on your code. Therefore it is just more comfortable and quicker to extract data from your local computer. If you don’t have enough storage capacity, it is enough to just store the first 1000 rows of data instead of all filtered data. For this project, I downloaded relevant StackOverflow data (3.4 Gigabyte) locally using SQLite. To view a SQLite database, you need to install: DB Browser for SQLite.

DB Browser for SQlite

C.3. Some side notes

  • For this article series, I used Python programming language intensively. Mostly for preprocessing data and applying ML algorithm for topic discovery purpose.
  • I used SQL at early stage of this project to collect data. At the very last stage, I also used it for “Data Exploration” step. Because at that point, it was less of hustle and all heavy lifting ML Python-tasks were finished already.

D. Code and Explanation

D.1. Open BigQuery Console

  • Open BigQuery website,
  • Click “Sign in” button (top right corner) and use your Google account,
  • After signing in, click “Console” (top right corner).

D.2. Create New Project and Open Its Dashboard

  • Now create a “New Project” by following this video and name it with something like “ProjectStackOverflow”:
  • Next click “Dashboard”:
  • Select “BigQuery” under Resources section.

D.3. Open Public Datasets on BigQuery

  • Enter “bigquery public datasets” into the search at the top and click the following:
  • Search the following button and click it:
  • Now you can find all public datasets hosted on BigQuery:
  • You might want to bookmark this page for your next projects.

D.4. Open StackOverflow Dataset

  • Search for “Stack Overflow” and click “View Dataset”
  • You might want to mark “stackoverflow” as your favourite:
  • I suggest you to do the same for “bigquery-public-data” as well, to make public-data visible for your next project:
  • Now click “post_questions”:
  • Under new tab, click “in new tab”.
  • Afterwards, the following tab can be seen:

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Note:

For starter you can run the following SQL syntax and check the result. By using public dataset, you are equipped to learn SQL.

SELECT *
FROM `bigquery-public-data.stackoverflow.posts_questions`
LIMIT 1000

You can check the dataset posts_questions directly by opening “PREVIEW”:

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

D.5. SQL Operation on BigQuery

This are the SQL code:

SELECT *
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE (body LIKE "%secur%" OR body LIKE "%protect%") AND
(body LIKE "%custom%" OR body LIKE "%config%" OR
body LIKE "%variab%" OR body LIKE "%featur%" OR
body LIKE "%plug%" OR body LIKE "%system_family%" OR
body LIKE "%product_family%" OR body LIKE "%software_family%")
ORDER BY creation_date DESC

Explanation for preceding query:

  1. SELECT *: Show me data from all (*) columns of …
  2. FROM '...' : … table posts_questions of public StackOverflow dataset …
  3. WHERE: … with the following requirements: E.g. the column body contains word “secur” or “protect”, etc.
  4. ORDER BY creation_date DESC: Sort all row in descending manner based on its creation_date.

After running the query, you can save the result as “BigQuery table”:

Afterwards, you need to:

  1. Select one Data set or create one, if you haven’t created one. In the screenshot below, I selected one existing Data set.
  2. Give a name to the new table. In this example, I named it: posts_questions_filtered. I suggest you use the same name as the screenshot below, so that you can follow along my next articles without having to find what name you used before.

Now you can find the new exported table on the left side bar.

D.6. Generate and Download a Google Cloud Service Account Key (JSON File)

A Google account key is necessary to download data programatically. Follow the following video (or this link) to generate a service account. Choose the role as owner (check 1:50) and download the JSON file (2:55). Be careful not to accidentally share the JSON file (e.g. due to Git version control).

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

In the following section, we start with importing BigQuery table to local SQLite database. For this, we need the following Python code as backbone:

  1. Python code to download BigQuery table (D.7),
  2. Python code to store the downloaded table to SQLite database (D.8),
  3. Python code to combine the preceeding 2 operations (D.9).

I explain the necessary Python code from high level, since (I counted) there are > 2500 rows of code. Therefore, I attach link for all necessary code here and explain the important parts. If you have questions, don’t hesitate to contact me: here.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

D.7. Import BigQuery Table to Local SQLite

Complete code: here.

...
class BigqueryOperation():
def __init__(self, PATH_GOOGLE_CREDENTIAL:str) -> None:
# Read credential.
self._credentials = service_account.Credentials.from_service_account_file(\
PATH_GOOGLE_CREDENTIAL)
# Construct a BigQuery client object.
self._client = bigquery.Client(credentials=self._credentials)
self.last_request_schema = []
def query_request(self, QUERY:str) -> pd.DataFrame:
# Make an API request.
query_job = self._client.query(QUERY)
query_result = query_job.result()
# Get the schema and save as self.last_request_schema
self.last_request_schema = []
for i in range(len(query_job.result().schema)):
column_dict = query_job.result().schema[i].to_api_repr()
column_str = " ".join(column_dict.values())
self.last_request_schema.append(column_str)
# Create an iterable of pandas DataFrames, to process the table as a
# stream by using: .to_dataframe_iterable().
df_iterable = query_result.to_dataframe_iterable()
return df_iterable

Description for Python code from the preceding link:

  1. It contains Python class to get specific table from BigQuery.
  2. When initialising the class, you need to save the file path for Google account key generated from step D.6 under variable PATH_GOOGLE_CREDENTIAL.
  3. When executing query_request, it needs a BigQuery code as QUERY input. Your FROM statement might different depending on the names you gave (refers to this screenshot) from step D.5.
if __name__ == "__main__":
PATH_GOOGLE_CREDENTIAL = "../Credentials/service-account-file.json"
QUERY = """
SELECT *
FROM `projectstackoverflow.dataset_stackoverflow.posts_questions_filtered`
ORDER BY creation_date DESC
LIMIT 100
"""
...
bigquery_object = BigqueryOperation(PATH_GOOGLE_CREDENTIAL)

Worth mentioning:

df_iterable = query_result.to_dataframe_iterable()

The preceding code is essential, especially when working with big data. It is like saying:

Hey, get me one billion rows of data. But please don’t send it to me all at once. Instead send the next row, each time I finished with the last one.

Before continue to next part, please test if the code could run on your computer. Always divide and conquer your big projects into smaller chunks and test it one by one before combining them together and test it together!

D.8. Store Data into Local SQLite Database

Complete code: here. It contains Python class for all SQLite operations. At this point, you should download a SQLite viewer.

When working with DB, most of the time you want to either write a table into DB or read a table. To write a table, you need to execute the following steps:

...
sqlite_handler = SqliteOperation(PATH_SQLITE)
sqlite_handler.create_table(NAME_TABLE, df_schema)
...
for i in df_iterable:
...
sqlite_handler.insert_table_from_df(NAME_TABLE, i)
...
  1. Execute create_table method. This step is only necessary when creating a table. If you have already a table for your data, you can skip this step. This method requires following 2 inputs: a) the name for new table, b) table schema stored in a list. Table schema contains information which data format each column has. (String, Integer, etc.) In the next section, we will just copy the same schema from BigQuery instead of manually creating it.
  2. The for statement in the code block above iterates each row of data (in Pandas dataframe format) and executes insert_table_from_df method. This method is also optimised, so that it “temporarily save” 100 rows of data in memory and store it in DB afterwards by commiting commit(). (Check the if statement in the next code block) If it instead commits after receiving a row of data, it will takes too long to store data.
def insert_table_from_df(self, name, df_to_insert:pd.DataFrame) -> None:
# Wraper for insert_table method for dataframe input.
count = 0
commit_ = False
df_to_insert = self.repair_df_format(df_to_insert)
label = list(df_to_insert.columns)
for i in df_to_insert.itertuples(index=False):
self.insert_table(name, label, list(i), commit_)
# Divide and conquer commiting:
count += 1
if count == 100:
commit_, count = True, 0
else:
commit_ = False
self._connect.commit()
# print("*** Query commited successfully.")
return

D.9. Combine Preceding 2 Operations (D.7 & D.8)

Complete code: here.

from a_requesting_bigquery import BigqueryOperation
from b_sqlite_operation import SqliteOperation
import time

PATH_GOOGLE_CREDENTIAL = "../Credentials/service-account-file.json"
QUERY = """
SELECT *
FROM `projectstackoverflow.dataset_stackoverflow.posts_questions_filtered`
ORDER BY creation_date DESC
"""
NAME_TABLE = "raw_datas"
PATH_SQLITE = "./DB/StackOverflow_Medium.sqlite"

bigquery_handler = BigqueryOperation(PATH_GOOGLE_CREDENTIAL)
df_iterable = bigquery_handler.query_request(QUERY)
df_schema = bigquery_handler.last_request_schema

sqlite_handler = SqliteOperation(PATH_SQLITE)
sqlite_handler.create_table(NAME_TABLE, df_schema)

count = 0
total_row = 0
for i in df_iterable:
total_row += len(i)
print("{}. Total row so far: {}.".format(str(count), str(total_row)))
sqlite_handler.insert_table_from_df(NAME_TABLE, i)
del i
count += 1
time.sleep(3)
print("Finished")

Before executing the code from previous link, the following variables need to be updated: PATH_GOOGLE_CREDENTIAL, QUERY, NAME_TABLE, PATH_SQLITE. Your FROM statement for variable QUERY might different depending on the names you gave (refers to this screenshot) from step D.5. If you did exactly as described in this article, you can just copy paste the last code block and save it under the name c_bigquery_sqlite.py.

Quiz:

If you want to do this project with smaller amount of data (instead with all available data), you can also modify the QUERY, to only download data from year 2017–2022. Google (or nowaday perhaps ChatGPT) is your best friend, if you don’t know how to solve programming problem. I will reveal the answer in the next article.

I hope you could benefit from this tutorial. If something’s unclear or you have suggestions, feel free to contact me here or use the comment section.

If you read this far and like it, please clap this article. Don’t forget to follow me, if you don’t want to miss my next article.

--

--

Gunardi Ali

Like problem solving in Python. Studying Data Science while working in automobile industry as PLM expert.