Training of ML Model

Manual classification of records is done to train the ML model and achieve higher accuracy with every scan. The basic idea is to go through every record and verify if the machine has classified the record correctly, if not we update it manually. This process is done every time we do a scan for any share of any client.

The steps to complete this process is described below:

Method 1: Manual Method

Step 1:

After scan, you will be provided with a CSV file.
The file will look something like this:

Step 2: (Optional Step)

Open a Blank Excel and import the csv in excel. You can do this by selecting ‘Get data from text/csv’ in the ‘Data’ menu.
In the dialog box choose the File Origin as - UFT 8, so that all the characters are visible clearly. Note: We have clients all over the world, so we get documents in many languages

After loading the excel will look as below:

Save the new excel as a workbook.

Step 3: (Optional Step)

Remove these columns from the excel
- File Length
- Last Moved/Created Date
- Last Modified Date
- Category Confidence
- Subcategory Confidence
- Pii Confidence
- Classification
- Success Code
- MD5 (matching values are duplicates)
After the columns are removed the excel will look like this:

Step 4:

Enable the Filter by selecting ‘Filter’ in the ‘Data’ menu
Using the File Type Column, filter out all file types except for these:
- Csv
- Doc/Docx/Docm (Any other variation related to word doc)
- Pdf
- Ppt/pptx (Any other variation related to office ppt)
- Xls/xlsb/xlsm/xlsx (Any other variation related to office excel)
After the extra file types are removed

Step 5:

Add the columns mentioned below in the excel:
- Department/MainShare
- File Checked Y/N
- File Value changed Y/N
- Validate Date
- Previewed (Y/N)
Once the columns are added, sort the columns in the following order:
- Department/MainShare
- Path
- Category Name
- SubCategory
- File Checked Y/N
- File Value changed Y/N
- Validate Date
- Previewed (Y/N)
- Contains PII
- FileType
The excel will now look like this:

Step 6:

Below is a detailed explanation of the columns:

Column Name	Pre-Populated	Description
Department/MainShare	No – Filled by QA Analyst	The department or share for which scan is done. Eg: Path: /mnt/Extract$/Central Operations/, QA analyst will fill ClientName-Extract in the column, where Extract is share/department.
Path	Yes	Complete path of where the document is kept.
Category Name	Yes	The category name that is assigned to the record. Eg: HR Data.
SubCategory	Yes	The subcategory that is assigned to the record. Eg: Timesheet.
File Checked Y/N	No – Filled by Analyst	If the analyst has validated the record, then column populated as Y else N
File Value changed Y/N	No – Filled by Analyst	If the analyst has changed either the category or subcategory of the record, then column is populated as Y else N
Validate Date	No – Filled by Analyst	Date on which the record was validated
Previewed (Y/N)	No – Filled by Analyst	If the analyst has seen the original document, the column is populated as Y else N
Contains PII	Yes	If the record contains any form of Personal Identifiable information column is t (True) else f (False).
FileType	Yes	File type of the document.

Step 7:

After the basic clean-up is done, the file is ready for classification. The analyst will go through each record one by one and populate the columns. The basic task here is to look at the Category and Subcategory of the record and validate if the machine has classified the file correctly, if not the analyst must change it.

Scenario 1: Preview is available on the below screen, when you click on the path link the document will be visible

Look at the file path below, based on the file name we cannot be sure if the Category Name: Business_Documents and Subcategory: Audit is right or wrong.

Use the preview option to view the file and if after viewing you can say that the document is a business document related to Audit, no change is required to the Category and Subcategory Column.

Update the File Checked: Y and File Value Changed: N

But if after looking at the document you think that the machine has NOT assigned the right category you can change it. In this case I decided after viewing the document that it is an invoice so I will change the Category Name: Finance_Documents and Subcategory: Invoice/Purchase Order

Also update the File Checked: Y and File Value Changed: Y

Scenario 2: If Preview is NOT available:
- When the above screen is not available (for some reason), the analyst must use their best judgement from what they can discern from the file name and location.
- If we take the same EG as above, we can see that the file is stored in a folder of invoices, so I would change the category and subcategory to reflect as invoice.

It is important to remain consistent with yourself while judging categorization. If the analyst marks one file as Note/Log, they should make sure to always categorize similar files as Note/Log. This is more important than being perfectly accurate.

Step 8:

After above steps are completed send the updated CSV file to Getvisibility team and they will use the inputs to train the model for better accuracy and overall results.

Method 2: Online Method

Using Portal to train the model.

Navigation: Login into the application Landing Verify

After you click verify below screen for Manual Classification of Files will be displayed:

Click on the path link to view the documents.
Based on the document content change the Category or sub-category or PII
Click on Update button.
The new values will be stored which can be used for training of ML model.

GV-KnowledgeBase