BFMracing

General Category => General Board => Homework Haven => Topic started by: WibblyWib on August 29, 2010, 06:09:05 AM

Title: Removing HTML tags
Post by: WibblyWib on August 29, 2010, 06:09:05 AM
So I've got a bit of my website where users submit comments and it is displayed in a div. I want to remove ALL code which they might have placed in the submission, anything from a line break to embedded javascript. From a security perspective, would it be sufficient to simply have a php function which scans the string, and locates and removes any <s, >s, or &s?

I know there is a strip_tags function in php, but I would rather only remove the above characters incase it wasn't intended to be an html tag, and apparently strip_tags can get confused by certain symbols...
 
Anyone in-the-know on this kinda thing?

:) wib
Title: Re: Removing HTML tags
Post by: MrMxyzptlk on August 29, 2010, 02:01:56 PM

No, that would not be sufficient.  (And no, I have never specifically done that kind of thing before....)

It seems to me the simplest way would be to disallow any non-alphanumeric keystroke input other than the few semi-standard language ones. (E.g.: ! ( ) - : ; " ' , . ? )

This, of course, would only apply to English style languages tho.

Masking keystroke input is very easy and straightforward in any web/programming language.  Here's an in-line javascript example function that limits keyboard input to numbers:


Code: [Select]
<script type="text/javascript"><!--

function onlyNumbers(evt)
{
//alert("@onlyNumbers1");
    evt = (evt) ? evt : event;
    var charCode = (evt.charCode) ? evt.charCode : ((evt.keyCode) ? evt.keyCode :
           ((evt.which) ? evt.which : 0));
//alert("onlyNumbers2: charCode=" + charCode);
    if (charCode > 31 && (charCode < 48 || charCode > 57))
    {
//        alert("Please enter numerals only in this field.");
        return false;
    }
    return true;
}

 --></script>
(Note: You'll need to alter the code snipet above to include all the characters that you want to allow....)

It can be invoked in HTML via

Code: [Select]
<input type=text onkeypress="return onlyNumbers(event)"> directive.


Title: Re: Removing HTML tags
Post by: BFM_Kiwi on August 29, 2010, 05:19:41 PM
I have tried this before (not with php though) and I would recommend you NOT try to write something to look at specific characters or patterns, if you are looking for something robust and safe, you will probably fail!  Because basically what you are trying to write is a small parser (or compiler) and that is not an easy thing to do.   If someone is malicious they'll probably be able to get around your attempt to block them

Rather than removing the characters, the best thing to do is simple encode the characters so that while they will display to the user, they will not be interpreted as html characters.

Most languages like php will have an html encode/decode.  Or in javascript you use escape (to encode) and unescape (to convert back to html).

So you would just say

var encodedHTML = escape( user_input_string )


See this link.  It has some javascript code and a live demo of it where you can type html tags and it will encode (escape) your text safely.  

http://www.yuki-onna.co.uk/html/encode.html

Title: Re: Removing HTML tags
Post by: WibblyWib on August 30, 2010, 09:44:04 AM
Thanks for your replies MrMxyzptlk, BFM_Kiwi.

Kiwi I don't quite follow what the end result of what you are suggesting would be.

Once the string is escaped, and looks like this 'hello%3Cbr%3Enew%20line', then what...?

I could then replace all instances of %xx with a SPACE, for example? (or indeed selecting specific symbols to replace)

Or....?

I don't see how the escape function enables you to display html characters without executing them.

Also, what characters are unsafe other than & < > " ' ? php functions such as htmlspecialcharacters (http://php.net/manual/en/function.htmlspecialchars.php) only deal with these...
Title: Re: Removing HTML tags
Post by: WibblyWib on August 31, 2010, 03:20:52 PM
Following further research, I wrote/converted a bit of php to convert characters to their ascii value, which can be printed as it is and will display nicely in a browser.

Seems like a good solution to me (though I'm no hacker... :hrmbig: ).

Heres the code (if anyone is interested!) Any characters which are not on the list get turned into SPACEs

Code: [Select]
<?php

$char_array
['a'] = "&#38;#97;";
$char_array['b'] = "&#38;#98;"
$char_array['c'] = "&#38;#99;";
$char_array['d'] = "&#38;#100;";
...... 
//here some more verboseness (letters and numbers)
$char_array[' '] = "&#38;#32;";
$char_array['!'] = "&#38;#33;";
$char_array['"'] = "&#38;#34;";
$char_array['#'] = "&#38;#35;";
$char_array['$'] = "&#38;#36;";
$char_array['%'] = "&#38;#37;";
$char_array['&'] = "&#38;#38;";
$char_array["'"] = "&#38;#39;";
$char_array['('] = "&#38;#40;";
$char_array[')'] = "&#38;#41;";
$char_array['*'] = "&#38;#42;";
$char_array['+'] = "&#38;#43;";
$char_array[','] = "&#38;#44;";
$char_array['-'] = "&#38;#45;";
$char_array['.'] = "&#38;#46;";
$char_array['/'] = "&#38;#47;";
$char_array[':'] = "&#38;#58;";
$char_array[';'] = "&#38;#59;";
$char_array['<'] = "&#38;#60;";
$char_array['='] = "&#38;#61;";
$char_array['>'] = "&#38;#62;";
$char_array['?'] = "&#38;#63;";
$char_array['@'] = "&#38;#64;";
$char_array['['] = "&#38;#91;";
$char_array['\\'] = "&#38;#92;"// double \\ required as one cancels the ' character. Tested - still works
$char_array[']'] = "&#38;#93;";
$char_array['^'] = "&#38;#94;";
$char_array['_'] = "&#38;#95;";
$char_array['`'] = "&#38;#96;";
$char_array['{'] = "&#38;#123;";
$char_array['|'] = "&#38;#124;";
$char_array['}'] = "&#38;#125;";
$char_array['~'] = "&#38;#126;";

$input_string "this the string~~;#'';}{}.;.*^&^&$)(*\© "//test input string

for($i=strlen($input_string)-1$i>=0$i--){

$char substr($input_string$i1); //get individual character
echo "<br>".$char.": "//echo for prosperity

$valid array_key_exists($char,$char_array); //check if char is in array

if($valid==false){ // i.e. not in the list of accepted chars
$input_string substr_replace($input_string,"&#38;#32;",$i,1); //replace char with SPACE
echo "invalid";
}
else{
$input_string substr_replace($input_string,$char_array[$char],$i,1); //replace char with ascii value
echo "valid";
}

}

echo 
"<br>".$input_string;

?>