Cleaning Up HTML with Lua

Recently, I had to strip out some HTML entities from a RSS feed so it’s readable. Since there’s is not a built-in way to remove HTML entities in LUA, I built a quick table that contains some of the more commonly used elements. I used this table in conjunction with string.gsub() to iterate through a string and remove them. This approach is best used with smaller strings or snippets of a paragraph.

Here’s the code:

local myTable = "<p>This is some &lt; stuff &amp; things here.</p>"

local htmlToRemove = {
	{ "&nbsp;", " "},
	{ "&lt;", "<"},
	{ "&gt;", ">"},
	{ "&amp;", "&"},
	{ "<p>", ""},
	{ "</p>", ""},
	{ "<br.*/>", "\n"},
}

for i=1,#htmlToRemove do
	local temp = htmlToRemove[i]
	myTable = string.gsub( myTable, temp[1], temp[2])
end

print(myTable)

And the print statement will return this: This is some < stuff & things here. The for loop iterates over each element in htmlToRemove and replaces the characters with the second element. For example, the first item that gets replaced is the nbsp with a blank space (I left off the ampersand and semi-colon so it would appear correctly here). If you’d like to add more HTML entities to be removed, you’ll need to add them to the table htmlToRemove. Pass in the characters to look for in the first element and what to replace it with in the second. That’s it to removing html entities with Lua! If you have questions, please leave them below and thanks for reading!

Daniel Williams

Leave a Reply

Your email address will not be published. Required fields are marked *